Observations and annotations, comparing en-sentiment.xml and fr-sentiment.xml


Summary
1. "en-sentiment.xml" file (line 21)
1.1 Description  (l. 23)
2. "fr-sentiment.xml" file (l. 37)
2.1 Description (l. 39)
2.2 Rates analysis (l. 51)
2.3 Some selected adjectives analysis (l. 86)
2.3.1 "abracadabrant" (l. 87)
2.3.2 "allemand" (l. 102)
2.3.3 "transalpin" (l. 135)
2.3.4 "waouh" (l. 141)
2.3.5 pos: RB (l. 148)
3 Corpora (l. 155)
3.1 A few words (l. 157)
3.2 "Amazon.fr French book reviews" file (l. 164)
3.3 "Lexique 3 French word forms" file (l. 180)


en-sentiment.xml


SUBJECTIVITY LEXICON FOR ENGLISH ADJECTIVES. 
Adjectives have a polarity (negative/positive, -1.0 to +1.0) 
and a subjectivity (objective/subjective, +0.0 to +1.0). 
The reliability specifies if an adjective was hand-tagged (1.0) or inferred (0.7). 
[Words are tagged per sense, e.g., ridiculous (pitiful) = negative, ridiculous (humorous) = positive.] 
[The Cornetto id (lexical unit id) and Cornetto synset id refer to the Cornetto lexical database for Dutch.]
[The WordNet id refers to the WordNet3 lexical database for English.]
The part-of-speech tags (pos) use the Penn Treebank || tag set: NN = noun, JJ = adjective, ... 
For English movie reviews (Pang & Lee polarity dataset v2.0), the accuracy is 75% (P 0.76, R 0.75, F1 0.75).

--> 2896 adjectives
--> legend: [  ] = is not in the FR version


fr-sentiment.xml


SUBJECTIVITY LEXICON FOR FRENCH ADJECTIVES.
Adjectives have a polarity (negative/positive, -1.0 to +1.0) 
and a subjectivity (objective/subjective, +0.0 to +1.0). 
The reliability specifies if an adjective was hand-tagged (1.0) or inferred (0.7). 
The part-of-speech tags (pos) use the Penn Treebank II tag set: NN = noun, JJ = adjective, ... 
For French book reviews, the accuracy is 75% (P 0.76, R 0.75, F1 0.75).

- the fr-sentiment.xml file is less complete than the en-sentiment.xml one
- 5115 adjectives
- line 4925 : starting the list from a to z with accented first letter
- "sense" is missing --> which meaning of an adjective is rated?

Rates analysis:

- confidence (reliability hand-tagged (1.0) or inferred (0.7))
--- +0.7 to +1.0 (to be analysed more precisely)
--- 1.0 = manually annotated adjective, or an automatically annotated adjective selected by both annotators [researchers?], which appear in the same Cornetto synset as its gold1000 parent
--- 0.9 = selected by only one annotator but also appear in the same Cornetto synset as their parent
--- 0.8 = match their parent by Cornetto description
--- 0.7 = adjectives inherit their polarity, subjectivity and intensity from the average of all their selected parents
- intensity
--- +2.0 "clairement" and pos="RB" (adverb)
--- +1.3 "plus" and pos="RB"
--- +2.0 "profondément" and pos="RB"
--- +2.0 "si" and pos="RB"
--- +2.0 "simplement" and pos="RB"
--- +2.0 "terriblement" and pos="RB"
--- +2.0 "totalement" and pos="RB"
--- +2.0 "tres" and pos="RB"
--- +2.0 "très" and pos="RB"
--- +2.0 "vraiment" and pos="RB"
--- +2.0 "également" and pos="RB"
--- +2.0 "évidemment" and pos="RB"
--> all are adverbs
- subjectivity (objective/subjective, +0.0 to +1.0)
--- some negative subjectivity rates supposed to be +0.0 to +1.0 > max is -1.0
--- 35.0 "plein"
- polarity (negative/positive, -1.0 to +1.0) 
---
- pos (NN = noun, JJ = adjective, ...)
--- "RB" > adverb
--- "VB" > verb (adore, adorer, adoré, aime, aimerait, aimé, apprécié, arrêté (but arrêtée, arrêtées, arrêtés = JJ), conseillé, découverte, découvrir, dévorer, dévoré, ennui, ennuie, ennuyer (ennuyer as parent of ennui and ennuie?), essai, inspiré, moque, pleurer, recommande, recommandé, rigoler, rigolé, rire, soupir, sourire, vomir) 
--- "NN" > noun (atrocité, bijou, bonheur, cadeau, catastrophe, chef-d'oeuvre, coeur, con, connerie, d'émotion, d'émotions, dommage, déception, désastre, favori, foutaise, horreur, l'horreur, plaisir, poubelle, pépite, qualité, richesse, spontanéité, talent, volonté / émotion, émotions)
--- "UH" > interjection (hélas, non, oui, waouh)
- form
--- masculine, feminine, singular, plural
--- how are these variations rated? comparing to the english adjectives (1 form)?

Some selected adjectives analysis

01) "abracadabrant"


<word confidence="0.7" intensity="1.0" subjectivity="-0.15" polarity="-0.18" pos="JJ" form="abracadabrant"/>

<word confidence="0.7" intensity="1.0" subjectivity="-0.15" polarity="-0.18" pos="JJ" form="abracadabrante"/>

<word confidence="0.7" intensity="1.0" subjectivity="-0.15" polarity="-0.18" pos="JJ" form="abracadabrantes"/>

<word confidence="0.7" intensity="1.0" subjectivity="-0.15" polarity="-0.18" pos="JJ" form="abracadabrants"/>

--- is "abracadabrant" a common word in French? 
--- confidence only rated "0.7" > inferred
--- subjectivity should be rated between +0.0 to +1.0, but is rated "-0.15" > does it mean that objectivity should be rated between -1.00 to +0.0? or is it an error?
--- pattern-2.6\docs\html : subjectivity (objective/subjective, +0.0 to +1.0) > does the file is read anyway?

02) "allemand"


<word confidence="1.0" intensity="1.0" subjectivity="0.00" polarity="0.00" pos="JJ" form="allemand"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.00" pos="JJ" form="allemande"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.00" pos="JJ" form="allemandes"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.00" pos="JJ" form="allemands"/>

--- different confidence > the masculin form is the most confident, the other declined forms are inferred from it?
--- when other words are constructed with "e", "s", "es", is it easier to deduct their rate?

*****

<word confidence="1.0" intensity="1.0" subjectivity="0.00" polarity="0.07" pos="JJ" form="anglais"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.07" pos="JJ" form="anglaise"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.07" pos="JJ" form="anglaises"/>

*****

<word confidence="1.0" intensity="1.0" subjectivity="0.00" polarity="0.04" pos="JJ" form="ancien"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.04" pos="JJ" form="ancienne"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.04" pos="JJ" form="anciennes"/>

<word confidence="0.9" intensity="1.0" subjectivity="0.00" polarity="0.04" pos="JJ" form="anciens"/>

--- "-ne", "-nes", "-s"

03) transalpin


<word confidence="0.7" intensity="1.0" subjectivity="0.00" polarity="0.00" pos="JJ" form="transalpin"/>

<word confidence="0.7" intensity="1.0" subjectivity="0.00" polarity="0.00" pos="JJ" form="transalpines"/>

04) Waouh = UH (interjection)


Twitter search
@NintendoFrance
@GossipOnlineFR
What about "Zut"?

05) RB = adverb (Penn Treebank part-of-speech tags)


The file is mainly composed of adjectives (JJ), except for a few adverbs (such as "absolument", "très", "agréablement", "beaucoup"

All are tagged as word confidence=0.8 or 0.9.


Corpora


pattern-2.6\test\corpora
"The purpose of the corpora is for testing and evaluating the functionality in the Pattern module. These are not the original corpora; but samples that have been reduced in size and/or balanced. The original corpora can be found by following the links below."
- some links are accurate:
3) Clough & Stevenson's plagiarism corpus http://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html
- other not:
5) Amazon.fr French book reviews http://www.amazon.fr/

5) Amazon.fr French book reviews
--- "750 "positive" and 750 "negative" movie reviews."
--- "abracadabrant" 
> is not in "wordforms-fr-lexique.csv" 
> "polarity-fr-amazon.csv" > rated 1 or -1
--- line 455 / -1 / "sa mort abracadabrante"
--- line 509 / -1 / "l'idée est abracadabrante"
--- line 654 / ? / "sa mort abracadabrante"
--- line 832 / -1 / "Les scènes sont abracadabrantes"
--- line 1267 / -1 / "histoire abracadabrantesque"
--- line 1474 / -1 / "histoire abracadabrantesque"
--> are there duplications?

+ Exercices

Looking for "abracadabrant" and "gogol" words in the "polarity-fr-amazon.csv file", "waouh" on twitter and google (modifying the twitter.py file)

23) Lexique 3 French word forms
http://www.lexique.org/
--- 2000 word forms with lemma and part-of-speech.
--> Has the "fr-sentiment.xml" file been generated from those 2 files?