| Task Organizers | |
| Discussion Board | |
| Task Description | |
| Languages | |
| Subtasks | |
| Resources | |
| Evaluation | |
| Timeline | |
| Publications |
Els Lefever (els.lefever@hogent.be)
Veronique Hoste (veronique.hoste@hogent.be)
You can subscribe to our dedicated Google group at Cross-Lingual Word Sense Disambiguation.
Note: If you do not have a Google account, you can also subscribe to this discussion group
by sending an email to:
els DOT lefever AT hogent DOT be and mention the email address you would like to use.
There is a general feeling in the WSD community that WSD should not be considered as an isolated research task, but should be integrated in real NLP applications such as Machine translation or multilingual IR. Using translations from a corpus instead of human defined (e.g. WordNet) sense labels, makes it easier to integrate WSD in multilingual applications, and solves the granularity problem that might be task-dependent as well. Furthermore, this type of corpus-based approach is language-independent and can be a valid alternative for languages that lack sufficient sense-inventories and sense-tagged corpora.
We propose an Unsupervised Word Sense Disambiguation task for English nouns by means of parallel corpora. The sense label is composed of translations in the different languages and the sense inventory is built up on the basis of the Europarl parallel corpus . All translations (above a predefined frequency threshold) of a polysemous word are grouped into clusters/"senses" of that given word. The sense inventory for all target nouns in the development and test data will be manually built up by means of a concordance tool by three annotators. The translations are grouped/clustered by consensus; in case the annotators do not manage to reach consensus, we will apply soft-clustering for that particular translation (assign it to two or more different clusters).
The example below (English noun "paper") shows a number of clusters that are based on translations retrieved from part of the Europarl Corpus in four languages. In case the annotators do not agree on the number of clusters, a soft-clustering approach will be used on the cluster level.
| English: paper | Dutch | French | Italian |
|
Cluster 1 "green paper" |
boek, verslag, wetsvoorstel, kaderbesluit | livre, document, paquet | libro |
|
Cluster 2 "present a paper" |
document, voorstel, paper, nota, stuk, notitie | document, rapport, travail, publication, note, proposition, avis | documento, rapporto, testo, nota |
|
Cluster 3 "read a paper" |
krant, dagblad, weekblad | journal, quotidien, hebdomadaire | giornale, quotidiano, settimanale, rivista |
|
Cluster 4 "reams of paper" |
papier | papier | carta, cartina |
|
Cluster 5 "of paper", "paper industry", "paper basket" |
papieren, papier, prullenmand | papeterie, papetière, papier | cartastraccia, cartaceo, cartiera |
|
Cluster 6 "voting paper", "ballot paper" |
stembiljet, stembriefje | bulletin, vote | scheda, scheda di voto |
|
Cluster 7 "piece of paper" |
papiertje | papier volant | foglio, foglietto |
|
Cluster 8 "excess of paper", "generate paper" |
papier, administratie, administratief | paperasse, paperasserie, papier, administratif, bureaucratie | carta, amministrativo, burocratico, cartaceo |
|
Cluster 9 "on paper" |
in theorie, op papier, papieren, bij woorden | en théorie, conceptuellement | in teoria, di parole |
|
Cluster 10 "on paper" |
op papier | écrit, dans les textes, de nature typographique, par voie épistolaire, sur (le) papier | nero su bianco, di natura tipografica, per iscritto, cartaceo |
|
Cluster 11 "order paper" |
agenda, zittingstuk, stuk | ordre du jour, ordre des votes | ordine del giorno |
English
Dutch
French
German
Italian
Spanish
1. Bilingual Evaluation (English - Language X)
Example:
Sense Label = {oever/dijk} [Dutch]
Sense Label = {rives/rivage/bord/bords} [French]
Sense Label = {Ufer} [German]
Sense Label = {riva} [Italian]
Sense Label = {orilla} [Spanish]
2. Multi-lingual Evaluation (English - all target languages)
Example:
As the task is formulated as an unsupervised WSD task, we will not annotate any training material. Participants can use the Europarl corpus that is freely available. We have extracted the intersection of the five bilingual Europarl corpora (English-French, English-Italian, English-Spanish, English-German and English-Dutch), which results in a sentence-aligned 6-lingual corpus of 884,603 sentences.
The corpus can be downloaded at Europarl_Intersection.
Participants are free to use other training corpora, but additional senses/translations (which are not present in Europarl) will not be included in the sense inventory that is used for evaluation.
We will manually annotate development and test data. For the test data, native speakers will decide on the correct translation cluster(s) for each test sentence and give their top-3 translations from the predefined list of Europarl translations (see Evaluation).
We will use an evaluation scheme that is inspired by the English lexical substitution task in SemEval 2007. The evaluation will be done using precision and recall. We will perform both a "best result" evaluation (the first translation returned by a system) and a more relaxed evaluation for the "top five" results (the first five translations returned by a system). In order to assign weights to the candidate translations in the answer cluster(s) for each test sentence, native speakers will pick the three most appropriate translations from the predefined sense inventory.
Els Lefever and Véronique Hoste (2009), SemEval-2010 Task 3: Cross-lingual Word Sense Disambiguation, Proceedings of the Workshop on Semantic Evaluations: Recent achievements and Future Directions (SEW-2009), Boulder, Colorado, pp.82-87.