Meta :
To propose amendments to this document, use etherpad
http://10.9.8.7:9001/p/turning_text_into_numbers-amendments
To make comments, use etherpad
http://10.9.8.7:9001/p/turning_text_into_numbers-comments
The contents will evolve.

Turning text into numbers

context :

This text is a reworking and extension of the notes taken during the presentation by Hans Lammerant in the
http://10.9.8.7:9001/p/training-common-sense-day-2
document.
It tries to fit it more specifically to the task of explaining the processes designated as
step 3A: turning text into numbers
in
http://10.9.8.7:9001/p/training-common-sense


Text structure


A text has different levels of structure that code its meaning. One level of is syntax. The syntactic (grammar) rules code the way the different words in a sentence are assigned to fulfill different roles in the sentence.
Writing programs that 'decode' a text by syntactic analysis is possible but difficult, and recoding it in a format that permits further algorithmic treatment is even more difficult. (If you limit yourself to sentences with a very simple syntactic structure, the difficulty is manageable; but if you take normal, existing, non constrained texts; the combination of syntactic (grammar) rules can often overwhelm the decoding ability of a program).

That approach (syntactic analysis) has been mostly abandoned and efforts have been concentrated on statistical analysis of
a great number of texts (corpus).

There are different levels of sophistication of statistical analysis. Each method can be viewed from two complementary perspectives : what is effectively done, what information is extracted from the text for further treatment; what is not extracted, and has absolutely no way to have an influence on the further treatment.
  The simplest level is where no attempt is made to take account of the structure of a text, event the separation into sentences is ignored, all that extract is the number of occurrences of a word. So you get a :

bag-of-words


Bag of words: each word is placed treated as a dimension in a mathematical space. For each text, the number of occurences of a word is treated as the length of a vector in the dimension of that word.
There are thus as many dimensions in that mathematical space as there are different words in the text (or texts).

Even with this drastic level of exclusion of structure you will still have hundreds or thousands of dimensions.  That is difficult to illustrate.

So we go a step further and eliminate even the structure of words, and just keep the letters.  We have a :

bag-of-letters


If we do not take account of accents (present in french, but not in english), we have 'just' 26 dimensions, one for each character.


mathematical dimensions

In order not be confused by the use of the word 'dimension' it must be stressed that here we talk of 'mathematical dimensions', not of 'physical dimensions'.

If we stay at the level of approximation of newtonian physics, that corresponds fairly well to the 'common sense' feeling, 'physical space' can be represented by a 'mathematical space' with 3 dimensions.

But mathematical structures are not limited to those that approximate 'physical space'; and for mathematics a 3 dimensional space has no privilege over spaces with any number of dimensions. Most mathematical methods are independent of the number of dimensions of a space.

(in fact I think that the use of the word 'dimension' to describe 'physical space' has come through the influence of mathematics and physics. The 'common sense' 'physical space' is not symmetrical : there is vertical, lateral, and distance (depth))

************************************************************************************************************************
PRESENT END OF REWRITE BY FJ, will be continued
************************************************************************************************************************
example --> source: french and english versions of Shakespeare.
Each word an axe in a multidimensional space

Now: bag-of-letters, 'only' 26 axis; ie 'only' 26 dimensions; the vector has 26 coordinates
It is hard to imagine ... if it would be a, b, c you would have a 3D space

22 points (ie texts) with 26 dimensions: how many e's, how many b's
each text has only one point in the 26 dimensions

every text is a point in this multi-dimensional space, having 26 coordinates for the 26 dimension

"why does it have to be dimensions (why do we use the term dimensions)?" --> why this metaphorical form?

if you reduce dimensions, it gets simpler, you lose info

n-dimensions -- a math idea, not related to our 3D space
one of the basic tools in mathematics is thinking in dimensions
we do not talk about physical dimensions, we talk about mathematical dimensions

datamining: 'trying to get some meaning out of this'

now: translating the texts to a mathematical space. For this you need to simplify: forget about word order (if you would keep it, it would explode the amount of dimensions)
the points are a very simplificated model of the text, which got rid of the meaning . It is common practice.

an ordered, or unordered text of Hamlet, does not matter for the machine

your label/annotation is another column/dimensions/coordinate in your dataset, that changes the location of a point within the 26 (now 27) dimensions


> what does the algorithm see?

Multidimensional scaling
 reveals clusters, in this case: language difference.
(example: from a lot of colors in a picture to grayschale, and so to two dimensions)
you rotate the axis in different way until you see the biggest difference.
Metric MDS = multi-dimensional-scaling

--> the act of rotating the axis, in order to find an axe on which you can make a better differentation between points
MDS = a program, a mathematical tool, which helps you finding the highest contrast
this step help you to find a view, and a way to reduce dimensions, and you can throw the rest out

two-dimensions --> is a plane
three dimensions --> projecting points on a plain 
this is also possible with 26 dimensions
and you look for a plane which covers your points ... 

in order to decide the point of one text into the 2-dimensional graph, in this example all the coordinates of the 26 letters are summed up, and divided by 26 --> average number
these number doesn't have any meaning anymore, it is a way to differentiate between texts

making a bag-of-letter model is a very simplified model of the texts, on such a level that you can read another type of information

this step reduces information, not by reducing dimensions


(if you're looking for a model ... you start to make a dataset, with an amount of dimensions... but in order to get a working model ... you need to be able to recognize your expectations ...)

modeling the line
from now on, you can throw all the data-points away, because you have a model 
this is the moment of truth construction

"and you hope that it has something to do with the reality you want to apply it to"

"the big leap is to check if your model is able to predict something later"


knowledge discovery implies it can find clusters by itself
what if you find a contrast in a vector space, but you don't know what it means
> how do you know what it means?
< the only way is to check it to other data
> so you check each regularity, even if you don't know what kind of respondence with reality it has 
< a kind of myth what is present in data-mining, is that an algorithm can discover this respondence

the question is if the X's & O's come before the line? or after the line? when is the data labeled? is that a process? 
(is the data supervised?)

a kind of flip-flop process where you look for a differentation, then there is a moment where you can create your model, and then your differentation is *fixed* and from that on, it applies the model to other data 

a hypothesis is always present in creating a model 

in traditional statistics, there always is a hypothesis to get what you want
in data-mining, there seem to still be a hypothesis present
the point is: when do you formulate your hypothesis? 


validation phase 
--> a validation method is comparing your results with another text

overfitting --> making sure if very specific points are included in your model ... 

there are standard validation procedures to reaching the moment of 'it works'

Q: What is the name of the dimension to rotate around? 
A: The rotating has nothing to do with meaning ... it is just the possibility to represent in 2D at some point?
A: The average is one point in the process. Relative distribution of letters ... we normalize. Afterwards: algorithm helps you find the plane with the most extreme difference.
Q: Can you see the process without, before the normalization
A: If I would not normalize, ..

Q is there a way of looking at the process of the algorithm? to look at the moment between 

outliers
> your algorithm gets better if you take your outliers out 
< is the outlier a spelling mistake? or really a outlier?
> is removing outlier a standard practise?
< well, checking it is. and checking if it is a mistake or not 

> does the amount of dimensions influence the efficiency of the model, is there an interest in the amount of dimensions?
< if there is a difference between 1000 and 1000000, there is a difference ... and there is also a difference in type of dimensions ... 
> i'm trying to understand the economy of the dimensions... 
>> but the economy is in the results of the model?