TEST CORPORA
============
The purpose of the corpora is for testing and evaluating the functionality in the Pattern module. These are not the original corpora; but samples that have been reduced in size and/or balanced. The original corpora can be found by following the links below.
#!PATTERN+ : More accuracy in the phrasing and in the information is needed. The compilation seems to have been made hastily, leaving assumptions implicit and giving no information who assembled it.
The corpora are meant for personal use, they are not part of the module's BSD license.
#!PATTERN+ : What is personal use of software ? What is the legal status of the corpora files ?
1) Through the Looking-Glass, written by Lewis Carroll
- carroll-lookingglass.pdf
#!PATTERN+ : The file distributed is actually called carroll-lookingglass.docx
- http://www.gutenberg.org/
#!PATTERN+ : The text is available at the url https://www.gutenberg.org/ebooks/12 in many formats but no docx. How and why was it selected as example of English langage ?
- Chapter 1 of Through the Looking-Glass in Office Open XML format.
2) Alice in Wonderland, written by Lewis Carroll
- carroll-wonderland.pdf
- http://www.gutenberg.org/
#!PATTERN+ : The text is available at the url https://www.gutenberg.org/ebooks/11 How and why was it selected as example of English langage ?
- Full text of Alice in Wonderland in PDF format.
3) Clough & Stevenson's plagiarism corpus
- plagiarism-clough&stevenson.csv
- http://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html
#!PATTERN+ : The link describes the content and the construction of the corpus. The authors of the corpus are scholars in information science and computer science. The attribution does not follow the proper citation (Clough, P. and Stevenson, M. Developing A Corpus of Plagiarised Short Answers, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis, In Press). "Corpus of Plagiarised Short Answers by Paul Clough and Mark Stevenson is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License."
- 100 texts: authentic (0), heavy (1) or light revision (2), cut & paste (3).
4) Amazon.de German book reviews
- polarity-de-amazon.csv
- http://www.amazon.de/gp/bestsellers/books/
#!PATTERN+ : The url does not give much information such as the method for capturing data, the date of capture... How was the selection made for the 100 "positive" and 100 "negative" book reviews ? Are the reviews for the 100 books that got sold the most or are the most commented (1 positive/1 negative per book) ? or all the comments about a selection of "best-sellers" ? Is the choice made on the ratings on the commenter ? The number of votes a comment get ? What are the categories for the choice of books ? Who assembled the file ? Was the same method used for French and German reviews (despite the disparity in number and url) ?
- 100 "positive" and 100 "negative" book reviews.
5) Amazon.fr French book reviews
- polarity-fr-amazon.csv
- http://www.amazon.fr/
#!PATTERN+ : The url does not give much information such as the method for capturing data, the date of capture... Are the reviews for the most 750 sold books (and then it is physical book and/or digital version ?) or those who are commented the most ? Is it 1 positive and 1 negative review per book ? Is the number of "stars" the guiding choice ? Is the choice made on the ratings on the commenter ? The number of votes a comment get ? What are the categories for the choice of books ? Who assembled the file ? Was the same method used for French and German reviews (despite the disparity in number and url) ? A quick look at the content of the file seems to indicate that the reviews are mostly about novels (are they from the French literature section only ?). My eye got caught by a comment about G. Perec's la vie mode d'emploi that is a book which (at the date of writing this note on 25th August 2015) is rated 4.1 "stars" out of 5 by commenters while being at the 24592th position in the sale of French literature books. It has 21 comments and the one that is on the csv file has been rated "useful" by 10 people out of 12 (that commenter has a 87% usefulness rate and is 82nd in the list of top reviewers). Another comment is about the same book, it is a negative one (rated as two stars in Amazon). That sample comment has been made in 2011 (same year as the positive one listed above). Two negative reviews with one star of that book are online on 25/08/2015 but were written in 2013 and 2015, so the negative comment rated two stars may have been the most negative comment at the time of the capture. Amazon.fr seems to consider that 4-5 stars represents positive comments and 1-3 stars represent a negative comment (the website called them "critique" - critical - instead of negative).
- 750 "positive" and 750 "negative" movie reviews.
6) Pang & Lee's sentence polarity dataset v1.0
- polarity-en-pang&lee1.csv
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
- 2000 "positive" and 2000 "negative" sentences.
7) Pang & Lee's polarity dataset v2.0
- polarity-en-pang&lee2.csv
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
- 750 "positive" and 750 "negative" movie reviews.
8) Bol.com Dutch book reviews
- polarity-nl-bol.com.csv
- http://www.bol.com/nl/m/nederlandse-boeken/literatuur/
- 1500 "positive" and 1500 "negative" book reviews.
9) German portion of Tiger Treebank (Brants et al.)
- tagged-de-tiger.txt
- http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora
- 250 German sentences with STTS part-of-speech tags.
10) English portion of Open American National Corpus (Ide et al.)
- tagged-en-oanc.txt
- http://www.anc.org/data/oanc/
- 1000 English sentences with Penn Treebank part-of-speech tags.
11) English portion of Penn Treebank (Marcus et al.)
- tagged-en-wsj.txt
- http://www.cis.upenn.edu/~treebank/home.html
- 1000 English sentences with Penn Treebank part-of-speech tags.
12) Spanish portion of Wikicorpus v.1.0 (Reese & Boleda et al.)
- tagged-es-wikicorpus.txt
- http://www.lsi.upc.edu/~nlp/wikicorpus/
- 1000 Spanish sentences with Parole part-of-speech tags.
13) Italian portion of WaCKy Corpus (Baroni et al.)
- tagged-it-wacky.txt
- http://wacky.sslmit.unibo.it/doku.php?id=corpora
- 1000 Italian sentences with Penn Treebank II part-of-speech tags.
14) Dutch portion of Twente Nieuws Corpus (Ordelman et al.)
- tagged-nl-twnc.txt
- http://hmi.ewi.utwente.nl/TwNC
- 1000 Dutch sentences with Wotan part-of-speech tags.
15) Apache SpamAssassin public mail corpus
- spam-apache.csv
- http://spamassassin.apache.org/publiccorpus/
- 125 "spam" and 125 (mostly technical) "ham" messages.
16) Birkbeck spelling error corpus
- spelling-birkbeck.csv
- http://www.ota.ox.ac.uk/headers/0643.xml
- 500 words and how they are commonly misspelled.
17) CoNLL 2010 Shared Task 1 - Wikipedia uncertainty
- uncertainty-conll2010.csv
- http://www.inf.u-szeged.hu/rgai/conll2010st/tasks.html#task1
- 1500 "certain" and 1500 "uncertain" Wikipedia sentences.
18) Celex 2.5 German word forms
- wordforms-de-celex.csv
- http://celex.mpi.nl/
- 250 singular nouns and their plural form.
- 250 predicative adjectives and their attributive form.
19) Celex 2.5 English word forms
- wordforms-en-celex.csv
- http://celex.mpi.nl/
- 4000 singular nouns and their plural form.
20) Celex 2.5 Dutch word forms
- wordforms-nl-celex.csv
- http://celex.mpi.nl/
- 1000 singular nouns and their plural form.
- 1000 predicative adjectives and their attributive form.
21) Davies Corpus del Espa?ol word forms
- wordforms-es-davies.csv
- http://www.wordfrequency.info/files/spanish/spanish_lemmas20k.txt
- 3000 word forms with lemma, part-of-speech and frequency.
22) Wiktionary Italian word forms
- wordforms-it-wiktionary.csv
- https://en.wiktionary.org/wiki/Category:Italian_language
- 2000 word forms with lemma, part-of-speech and gender.
23) Lexique 3 French word forms
- wordforms-fr-lexique.csv
- http://www.lexique.org/
- 2000 word forms with lemma and part-of-speech.