Meta :
To propose amendments to this
document, use etherpad
http://10.9.8.7:9001/p/turning_text_into_numbers-amendments
To make comments, use etherpad
http://10.9.8.7:9001/p/turning_text_into_numbers-comments
The contents will evolve.
Turning text into numbers
Text structure
A text has different levels of structure that code its meaning. One level of
is syntax. The syntactic (grammar) rules code the way the different words in
a sentence are assigned to fulfill different roles in the sentence.
Writing programs that 'decode' a text by syntactic analysis is possible but
difficult, and recoding it in a format that permits further algorithmic
treatment is even more difficult. (If you limit yourself to sentences with a
very simple syntactic structure, the difficulty is manageable; but if you
take normal, existing, non constrained texts; the combination of syntactic
(grammar) rules can often overwhelm the decoding ability of a program).
That approach (syntactic analysis) has been mostly abandoned and efforts
have been concentrated on statistical analysis of
a great number of texts (corpus).
There are different levels of sophistication of statistical analysis. Each
method can be viewed from two complementary perspectives : what is
effectively done, what information is extracted from the text for further
treatment; what is not extracted, and has absolutely no way to have an
influence on the further treatment.
The simplest level is where
no attempt is made to take account of the structure of a text, event the separation
into sentences is ignored, all that extract is the number of occurrences of
a word. So you get a :
bag-of-words
Bag of words: each word is placed
treated as a dimension in a mathematical space. For each text, the number of
occurences of a word is treated as the length of a vector in the dimension
of that word.
There are thus as many dimensions in that mathematical space as there are
different words in the text (or texts).
Even with this drastic level of exclusion of structure you will still have
hundreds or thousands of dimensions. That is difficult to illustrate.
So we go a step further and eliminate even the structure of words, and just
keep the letters. We have a :
bag-of-letters
If we do not take account of accents (present in french, but not in
english), we have 'just' 26 dimensions, one for each character.
mathematical
dimensions
In order not be confused by the use of the word 'dimension' it must be
stressed that here we talk of 'mathematical dimensions', not of 'physical
dimensions'.
If we stay at the level of approximation of newtonian physics, that
corresponds fairly well to the 'common sense' feeling, 'physical space'
can be represented by a 'mathematical space' with 3 dimensions.
But mathematical structures are not limited to those that approximate
'physical space'; and for mathematics a 3 dimensional space has no
privilege over spaces with any number of dimensions. Most mathematical
methods are independent of the number of dimensions of a space.
(in fact I think that the use of the word 'dimension' to describe
'physical space' has come through the influence of mathematics and
physics. The 'common sense' 'physical space' is not symmetrical : there is
vertical, lateral, and distance (depth))
************************************************************************************************************************
PRESENT END OF REWRITE BY FJ, will be continued
************************************************************************************************************************
- "my algorithm is reading Shakespear like this right now"
- "there are 26 dimensions, mathematically that is no problem"
example --> source: french and english versions of Shakespeare.
Each word an axe in a multidimensional space
Now: bag-of-letters, 'only' 26 axis; ie 'only' 26 dimensions; the vector has
26 coordinates
It is hard to imagine ... if it would be a, b, c you would have a 3D space
22 points (ie texts) with 26 dimensions: how many e's, how many b's
each text has only one point in the 26 dimensions
every text is a point in this multi-dimensional space, having 26 coordinates
for the 26 dimension
"why does it have to be dimensions (why do we use the term dimensions)?"
--> why this metaphorical form?
if you reduce dimensions, it gets simpler, you lose info
n-dimensions -- a math idea, not related to our 3D space
one of the basic tools in mathematics is thinking in dimensions
we do not talk about physical dimensions, we talk about mathematical
dimensions
datamining: 'trying to get some meaning out of this'
now: translating the texts to a mathematical space. For this you need to
simplify: forget about word order (if you would keep it, it would explode
the amount of dimensions)
the points are a very simplificated model of the text, which got rid of the
meaning . It is common practice.
an ordered, or unordered text of Hamlet, does not matter for the machine
your label/annotation is another column/dimensions/coordinate in your
dataset, that changes the location of a point within the 26 (now 27)
dimensions
> what does the algorithm see?
Multidimensional scaling
reveals clusters, in this case: language difference.
(example: from a lot of colors in a picture to grayschale, and so to two
dimensions)
you rotate the axis in different way until you see the biggest difference.
Metric MDS = multi-dimensional-scaling
--> the act of rotating the axis, in order to find an axe on which you
can make a better differentation between points
MDS = a program, a mathematical tool, which helps you finding the highest
contrast
this step help you to find a view, and a way to reduce dimensions, and you
can throw the rest out
two-dimensions --> is a plane
three dimensions --> projecting points on a plain
this is also possible with 26 dimensions
and you look for a plane which covers your points ...
in order to decide the point of one text into the 2-dimensional graph, in
this example all the coordinates of the 26 letters are summed up, and
divided by 26 --> average number
these number doesn't have any meaning anymore, it is a way to differentiate
between texts
making a bag-of-letter model is a very simplified model of the texts, on
such a level that you can read another type of information
this step reduces information, not by reducing dimensions
(if you're looking for a model ... you start to make a dataset, with an
amount of dimensions... but in order to get a working model ... you need to
be able to recognize your expectations ...)
modeling the line
from now on, you can throw all the data-points away, because you have a
model
this is the moment of truth construction
"and you hope that it has something to do with the reality you want to apply
it to"
"the big leap is to check if your model is able to predict something later"
knowledge discovery implies it can find clusters by itself
what if you find a contrast in a vector space, but you don't know what it
means
> how do you know what it means?
< the only way is to check it to other data
> so you check each regularity, even if you don't know what kind of
respondence with reality it has
< a kind of myth what is present in data-mining, is that an algorithm can
discover this respondence
the question is if the X's & O's come before the line? or after the
line? when is the data labeled? is that a process?
(is the data supervised?)
a kind of flip-flop process where you look for a differentation, then there
is a moment where you can create your model, and then your differentation is
*fixed* and from that on, it applies the model to other data
a hypothesis is always present in creating a model
in traditional statistics, there always is a hypothesis to get what you want
in data-mining, there seem to still be a hypothesis present
the point is: when do you formulate your hypothesis?
- now: when you see a differentiation, you start thinking what it could
mean
- so, there is another moment of formulating your hypothesis
validation phase
--> a validation method is comparing your results with another text
overfitting --> making sure if very specific points are included in your
model ...
there are standard validation procedures to reaching the moment of 'it
works'
- for example the 20% test-data, and 80% building data
- --> the 20% need to be labeled on before hand, in order to check
the model's results
Q: What is the name of the dimension to rotate around?
A: The rotating has nothing to do with meaning ... it is just the
possibility to represent in 2D at some point?
A: The average is one point in the process. Relative distribution of letters
... we normalize. Afterwards: algorithm helps you find the plane with the
most extreme difference.
Q: Can you see the process without, before the normalization
A: If I would not normalize, ..
Q is there a way of looking at the process of the algorithm? to look at the
moment between
- - there are points in the space, and the algorithm looks at it
- - and the algorithm gives back a result that a human can read
outliers
> your algorithm gets better if you take your outliers out
< is the outlier a spelling mistake? or really a outlier?
> is removing outlier a standard practise?
< well, checking it is. and checking if it is a mistake or not
> does the amount of dimensions influence the efficiency of the model, is
there an interest in the amount of dimensions?
< if there is a difference between 1000 and 1000000, there is a
difference ... and there is also a difference in type of dimensions
...
> i'm trying to understand the economy of the dimensions...
>> but the economy is in the results of the model?