training-common-sense-day-3

Random notes from Sunday

Seda preparing a session on Machine Learning: http://pad.constantvzw.org/public_pad/touchingCorrelations

Weka: http://www.cs.waikato.ac.nz/~ml/index.html (floss java package for machine learning)

"2 features are redundant if they are highly correlated with each other"

"occam's razor: simpler models are usually better"

Hamid Ekbia, Bonnie Nardi, Heteromation and its (dis)contents: The invisible division of labor between humans and machines
http://firstmonday.org/ojs/index.php/fm/article/view/5331

Alison Adams, Artificial Knowing: Gender and the Thinking Machine (1993)

where to go today (& tomorrow & ... )

* What would be a sentence that would have a sentiment polarity of 0.0?

popular polarity analysis

- sentiment analysis --> only polarity in Pattern that has examples
- - movie review database, star ratings
- - amazon-turk procedure (World Well Being Project?)
- - manually rated dataset (Pattern)
- age
- gender
- personality

* A proposed exercise:
we can change the en.sentiment.xml file in Pattern, with our own adjectives. --> which adjectives?
file used in other programs, elements of pattern?

list of exercises that could be done

* A map of Pattern structure
listing the Pattern-2.6.zip structure --> [[training-common-sense-Pattern-2.6-structure]]
and adding notes
looking at geneology; a mapping of where things come from or where else they appear

* Introduction: pattern-readme-plus.md
introduction to the forked Pattern-2.6 [[training-common-sense-pattern-readme-plus.md]]

* A note added to ...
how to name our comments/notes

- weird feelings about the fr.sentiment.xml - add a 'note'
- an attempt to annotate some french files : [[training-common-sense-french-files-analysis]]

* An explanation of vector-projections --> examples:
Principal Coponent Analysis (PCA) with Animation: https://www.youtube.com/watch?v=9DPiXrN2pEg

* Some quotes, references to critical resources

* a reflection on the data-mining culture --> where?

an option is ....

our-critical-fork-folder ....
- - pattern-2.6.zip
- - weka-3.0 (?)
- - data-mining-culture

or ....

meta-mining
- - pattern-2.6-critical-fork.zip
- - weka-3.0-critical-fork.zip
- - data-mining-culture
- - ...
- - ...

* how to create a 'critical-fork-method', a critical-issue-tracker

- Magic comments
- Issues
- add FIX-IT-like files (or comments into the code?), following this standard -> ask Michael
- AUTOPSY
- DISSECTION
- DISCUSS
- DEBATE
- DEBATABLE
- STUDY
- REFLECTION
- #CRITICAL-ISSUE
- Performance
- Read-write-execute

* create alternative type of tutorials of Pattern
next to the comments and files of the critical fork

- showing where Pattern doesn't 'work' or using different sources for getting data (free software resources, public domain resources, specific lexicons and datasets...)

* write the license for the critical fork (as "relearn" ? something else ?)

* Use criticisms of semantics
(with the focus on words as they relate to each other in a sentence) and pragmatics (with the focus on how meaning of words is always produced in a certain context) to think about the particular failure of assigning numerical value to, for e.g. adjectives

Notes from Pattern lexicon en.sentiment.xml

<word form="grotesque" cornetto_synset_id="n_a-503484" wordnet_id="a-00221627" pos="JJ" sense="distorted and unnatural in shape or size" polarity="-1.0" subjectivity="1.0" intensity="1.0" confidence="0.8" />
<word form="grotesque" cornetto_synset_id="n_a-535905" wordnet_id="a-00221627" pos="JJ" sense="distorted and unnatural in shape or size" polarity="-0.1" subjectivity="1.0" intensity="1.0" confidence="0.8" />

for the same word - Grotesque - there is the same definition, but differing polarity values "1.0" and "0.1"

note on avaraging a polarity-rate

comment written in the example file (pattern-2.6/examples/03-en), that directly uses the en.sentiment.xml file:

# The sentiment() function returns an averaged (polarity, subjectivity)-tuple for a given string.
--> the act of averaging is already admitted/described

notes on the construction of sentiment() & en.sentiment.xml

- an issue on the Github page of Pattern asks about the method behind the xx.sentiment.xml file. (see --> https://github.com/clips/pattern/issues/85 )
- Tom de Smedt explaining how the sentiment lexicon is constructed --> http://www.jmlr.org/papers/volume13/desmedt12a/desmedt12a.pdf :

4. Case Study
As a case study, we used PATTERN to create a Dutch sentiment lexicon (De Smedt and Daelemans, 2012). We mined online Dutch book reviews and extracted the 1,000 most frequent adjectives. These were manually annotated with positivity, negativity, and subjectivity scores. We then enlarged the lexicon using distributional expansion. From the TWNC corpus (Ordelman et al., 2007) we extracted the most frequent nouns and the adjectives preceding those nouns. This results in a vector space with approximately 5,750 adjective vectors with nouns as features. For each annotated adjective we then computed k-NN and inherited its scores to neighbor adjectives. The lexicon is bundled into PATTERN 2.3.
- what does 'manually annotated' mean?
- what are similar adjectives?
- --> we spoke about that here --> to compute the semantic similarity/relateness/mathematical-relation between adjectives --> [[training-common-sense-day-3]]

- the en.sentiment.xml file is then extended by using the adjectives used in the Pang & Lee dataset v2. This dataset is based on 1000 positive & 1000 negative movie reviews.

Pang & Lee, movie review dataset
developed at the Cornell University, NLP department --> https://confluence.cornell.edu/display/NLP/Home
* profiles:

- Lillian Lee --> http://www.cs.cornell.edu/home/llee/
- Bo Pang --> http://www.cs.cornell.edu/people/pabo/ --> https://sites.google.com/site/bopang42/

* links:

- [article] Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, Proceedings of ACL 2005. --> http://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf
- webpage of the Pang & Lee datasets --> http://www.cs.cornell.edu/people/pabo/movie-review-data/
- Pang & Lee, movie review dataset v2.0 --> http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
- Pool of 27886 unprocessed html files (81.1Mb) from which the polarity dataset v2.0 was derived. (This file is identical to movie.zip from data release v1.0.) --> www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip
- 100 papers using the Movie Review Data --> http://www.cs.cornell.edu/people/pabo/movie-review-data/otherexperiments.html