This archive contains additional material to our paper:

Christian M. Meyer and Iryna Gurevych: What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage, in: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), November 2011. Chiang Mai, Thailand.
http://www.ukp.tu-darmstadt.de/data/sense-alignment/

Please cite this paper if you plan to use our datasets.


--------------------------------------------------------------------------------

The following files are availble from our homepage:

1) ijcnlp2011-meyer-data-alignment.tsv.bz2

A full alignment of Wiktionary and WordNet in a bzip2 compressed, tab-separated file.

File format:
* WN_SYNSET: the synset id of WordNet 3.0. The id consists of a synset's database offset and its part of speech tag. See WordNet documentation for more information on synset ids: http://wordnet.princeton.edu/.
* WKT_ID: the sense id of the English Wiktionary edition from April 3, 2010 as generated by the Java-based Wiktionary Library (JWKTL) Version 0.15.2. See JWKTL homepage for more information on sense ids: http://www.ukp.tu-darmstadt.de/software/jwktl/. If you do not have a parsed Wiktionary edition for this date at hand, you can use the dataset described in (3).

This file contains all WordNet synsets and all Wiktionary senses. If a certain synset or sense has no counterpart in the respective other resource, "null" is printed instead of the synset or sense id.


--------------------------------------------------------------------------------

2) ijcnlp2011-meyer-data-classification.tsv.bz2

The classification data used to create the alignment described in (1) in a bzip2 compressed, tab-separated file.

File format:
* WN_SYNSET: the WordNet 3.0 synset id, see (1).
* WKT_ID: the Wiktionary sense id, see (1).
* SIM_COS: the similarity score of the COS measure described in the paper.
* SIM_PPR: the similarity score of the PPR measure described in the paper.
* IS_ALIGNED: the automatical judgment of our classifier, whether the WordNet synset and the Wiktionary sense should be aligned (= "1") or not aligned (= "0").


--------------------------------------------------------------------------------

3) ijcnlp2011-meyer-data-wiktionary.tsv.bz2

An excerpt of the English Wiktionary edition from April 3, 2010 parsed with the Java-based Wiktionary Library (JWKTL) Version 0.15.2. The data is available as a bzip2 compressed, tab-separated file.

File format:
* WKT_ID: the Wiktionary sense id for this Wiktionary edition. The ids correspond to our alignment results of the files (1) and (2).
* LEXEME: the lexeme that the Wiktionary article describes; e.g. 'plant'.
*	POS: the lexeme's part of speech tag (N = noun, V = verb, A = adjective, R = adverb, ? = other)
* GLOSS: the sense gloss; e.g. "an organism that is not an animal [...]" for 'plant'.
* EXAMPLES: an example sentence for this sense; e.g. "The garden had a couple of [...] plants around the border" for 'plant' (might be emtpy).
* SYNONYMS: synonymous words for this sense; e.g. 'gratis' for 'free' (might be empty; multiple synonyms are separated by semicolons).


--------------------------------------------------------------------------------

4) ijcnlp2011-meyer-dataset.txt

The annotated dataset used for the evaluation of our work.

File format:
* WN synset offset: the WordNet 3.0 synset offset, which is used to form the synset id.
* pos: the WordNet 3.0 part of speech tag of the synset, which is used to form the synset id.
* lemma: Wiktionary's lexeme.
* WKT id: the sense id in Wiktionary, as described in (1).
* annotation: the gold standard annotation for the sense pair of WordNet synset and Wiktionary sense.


--------------------------------------------------------------------------------

5) ijcnlp2011-meyer-dataset_annotation-guidebook.pdf

The corresponding annotation guidebook that was given to the annotators.


--------------------------------------------------------------------------------

License issues.

The Wiktionary dataset is available under the Creative Commons Attribution/Share-Alike License (CC-BY-SA). See http://creativecommons.org/licenses/by-sa/3.0/ and http://www.wiktionary.org/ for details.

WordNet is a registered trademark of the Princeton University. Please refer to http://wordnet.princeton.edu for further details. The data can also be obtained from their homepage.


--------------------------------------------------------------------------------

Contact.

In case of any questions, please don't hesitate to contact the corresponding author Christian M. Meyer: http://www.ukp.tu-darmstadt.de/people/christian-m-meyer/


