training-common-sense day 2


[[quotes]] : "on the circularity of classification"


ways to go

- going through Pattern, following the KDD steps 
- re-do (try to) of the facebook-messages data-mining 
- look at types of visualisation
- two ways of speaking about a KDD-proces

* [excuse me video] --> contradiction of desired + undesired result 
* which is also present in all the KDD steps --> common sense 
* interesting --> laughter --> when do we laugh?
* so many layers of signification that were present to create such wordcloud ... 
* difference between a truth, or a cultural pattern that appears in the language in a certain situation
* we know there are patterns in language use, we notice them ... science tries to create models to do predictions ... but what would be the solution to these patterns ... 
* if we would create such a wordclouds on for example race-classification ... it could be a more rigorous result ... but yes, it's the data right?
* will there always be a bias result of gender-classification? even if you perform an adequate research process
* difference between:
* there are analysis on age, gender, health, personality, ... --> global topics that can be connected
* statistics is a tool that is the result of an abstraction
* philosopher bigologist and ethologist (comparative animal behavior science) (died in 1978) ....Konrad Lorenz : "you have to swim in observations, before you can start a statistical experiments"
* history of orientic parole (the choise if a prisoner could go on parole), descision making process of this is highly influenced by data-mining prediction-tools
* there are suggestive algorithms for dating/products/etc., one loses your ability to make a descicion
* how did the facebook users react on the results of gender-mining? Look at qualitive elements of the research
* what is the aim, interest of facebook to focus on gender-mining?


going through Pattern, following the KDD steps:


- download Pattern, options: 

- examples: 


----------------------------------------------

amazing polarity average
amazing appears two times in the en.sentiment.xml-file:

In the example that Pattern provides (pattern-2.6/examples/03-en/07-sentiment.py) the word amazing gives a result of a polarity of 0.66666: 



which is the mathematical average between the polarity value of 0.8 of the first sense of amazing, and the polarity value of 0.4 of the second sense of amazing.

meaning is mathematically averaged...... (???????)

Looking through sentiment examples:

# It contains adjectives that occur frequently in customer reviews
ie sentiments related to consuming, but consuming what

en-sentiment.xml

"The reliability specifies if an adjective was hand-tagged (1.0) or inferred (0.7)."
Inferred = decided by the algorithm? Sentiment according to the machine?


it appears there is only reliability of 0.9 to be found; not each term has a reliability score.
Some adjectives have no wordnet id. Is the file a mash-up?
Sometimes the term "confidence" is used (reliability and confidence are never found on the same word, and confidence is either rated 0.8 or 0.9). Could this be the same?

Comparing en-sentiment.xml to SentiWordNet 3.0

<word form="affluent" cornetto_synset_id="n_a-526762" wordnet_id="a-02022167" pos="JJ" sense="having an abundant supply of money or possessions of value" polarity="0.6" subjectivity="1.0" intensity="1.0" confidence="0.8"/>

02022167        0        0.25        wealthy#1 moneyed#2 loaded#4 flush#2 affluent#1        having an abundant supply of money or possessions of value; "an affluent banker"; "a speculator flush with cash"; "not merely rich but loaded"; "moneyed aristocrats"; "wealthy corporations"

<word form="afloat" cornetto_synset_id="n_a-533320" wordnet_id="a-00076921" pos="JJ" sense="borne on the water" polarity="0.0" subjectivity="0.1" intensity="1.0" confidence="0.8"/>

00076921        0        0        afloat#2        borne on the water; floating

From sentiwordnet annotation:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
"objectivity = 1 - (PosScore + NegScore)"

affluent:  objectivity = 1 - (0 + 0,25) = 0.75
afloat: objectivity = 1 - (0 + 0) = 1

http://sentiwordnet.isti.cnr.it/

----------------------------------------------

Presentation Hans Lammerant


[[training-common-sense]] -> context
rewritten version by FJ --> ../training-common-sense/Hans_presentation1.html

----------------------------------------------------------------
----------------------------------------------------------------

construction of a certain visibility
text get simplified, a construction
get data out of text, to make it 'treatable' with math

step 3A: turning text into numbers

example --> source: french and english versions of Shakespeare. Bag of words: each word is placed in a mathematical space.

bag-of-words:

Each word an axe in a multidimensional space

Now: bag-of-letters, 'only' 26 axis; ie 'only' 26 dimensions; the vector has 26 coordinates
It is hard to imagine ... if it would be a, b, c you would have a 3D space

22 points (ie texts) with 26 dimensions: how many e's, how many b's
each text has only one point in the 26 dimensions

every text is a point in this multi-dimensional space, having 26 coordinates for the 26 dimension

"why does it have to be dimensions (why do we use the term dimensions)?" --> why this metaphorical form?

if you reduce dimensions, it gets simpler, you lose info

n-dimensions -- a math idea, not related to our 3D space
one of the basic tools in mathematics is thinking in dimensions
we do not talk about physical dimensions, we talk about mathematical dimensions

datamining: 'trying to get some meaning out of this'

now: translating the texts to a mathematical space. For this you need to simplify: forget about word order (if you would keep it, it would explode the amount of dimensions)
the points are a very simplificated model of the text, which got rid of the meaning . It is common practice.

an ordered, or unordered text of Hamlet, does not matter for the machine

your label/annotation is another column/dimensions/coordinate in your dataset, that changes the location of a point within the 26 (now 27) dimensions


> what does the algorithm see?

Multidimensional scaling
 reveals clusters, in this case: language difference.
(example: from a lot of colors in a picture to grayschale, and so to two dimensions)
you rotate the axis in different way until you see the biggest difference.
Metric MDS = multi-dimensional-scaling

--> the act of rotating the axis, in order to find an axe on which you can make a better differentation between points
MDS = a program, a mathematical tool, which helps you finding the highest contrast
this step help you to find a view, and a way to reduce dimensions, and you can throw the rest out

two-dimensions --> is a plane
three dimensions --> projecting points on a plain 
this is also possible with 26 dimensions
and you look for a plane which covers your points ... 

in order to decide the point of one text into the 2-dimensional graph, in this example all the coordinates of the 26 letters are summed up, and divided by 26 --> average number
these number doesn't have any meaning anymore, it is a way to differentiate between texts

making a bag-of-letter model is a very simplified model of the texts, on such a level that you can read another type of information

this step reduces information, not by reducing dimensions


(if you're looking for a model ... you start to make a dataset, with an amount of dimensions... but in order to get a working model ... you need to be able to recognize your expectations ...)

modeling the line
from now on, you can throw all the data-points away, because you have a model 
this is the moment of truth construction

"and you hope that it has something to do with the reality you want to apply it to"

"the big leap is to check if your model is able to predict something later"


knowledge discovery implies it can find clusters by itself
what if you find a contrast in a vector space, but you don't know what it means
> how do you know what it means?
< the only way is to check it to other data
> so you check each regularity, even if you don't know what kind of respondence with reality it has 
< a kind of myth what is present in data-mining, is that an algorithm can discover this respondence

the question is if the X's & O's come before the line? or after the line? when is the data labeled? is that a process? 
(is the data supervised?)

a kind of flip-flop process where you look for a differentation, then there is a moment where you can create your model, and then your differentation is *fixed* and from that on, it applies the model to other data 

a hypothesis is always present in creating a model 

in traditional statistics, there always is a hypothesis to get what you want
in data-mining, there seem to still be a hypothesis present
the point is: when do you formulate your hypothesis? 


validation phase 
--> a validation method is comparing your results with another text

overfitting --> making sure if very specific points are included in your model ... 

there are standard validation procedures to reaching the moment of 'it works'

Q: What is the name of the dimension to rotate around? 
A: The rotating has nothing to do with meaning ... it is just the possibility to represent in 2D at some point?
A: The average is one point in the process. Relative distribution of letters ... we normalize. Afterwards: algorithm helps you find the plane with the most extreme difference.
Q: Can you see the process without, before the normalization
A: If I would not normalize, ..

Q is there a way of looking at the process of the algorithm? to look at the moment between 

outliers
> your algorithm gets better if you take your outliers out 
< is the outlier a spelling mistake? or really a outlier?
> is removing outlier a standard practise?
< well, checking it is. and checking if it is a mistake or not 

> does the amount of dimensions influence the efficiency of the model, is there an interest in the amount of dimensions?
< if there is a difference between 1000 and 1000000, there is a difference ... and there is also a difference in type of dimensions ... 
> i'm trying to understand the economy of the dimensions... 
>> but the economy is in the results of the model?