[open-linguistics] Wiktionary RDF-extraction with DBpedia for en and de
Jonathan Pool
pool at utilika.org
Fri Dec 23 07:18:27 UTC 2011
The University of Washington Turing Center extracted data from Wiktionaries for the TransGraph database about 2006, but didn't publish its methods for re-use by others.
If your extractor were extended to cover word classes, definitions, and translations, I could use its output as input to PanLex and thereby better integrate Wiktionary data with data from other resources (http://utilika.org/info/plrefs.shtml).
For word-class categories, it seems to me that the OLIF list (in 3.2.1 on page 14 of http://www.olif.net/documents/NewOLIFstruct&content.pdf) resembles more than the GOLD list the categories that generally appear in conventional lexicographic resources. In PanLex, we have somewhat extended the OLIF list to:
adjv adjective
advb adverb
affx affix
auxv auxiliary verb
conj conjunction
detr determiner
ijec interjection
misc miscellaneous
name proper noun
noun noun
post postposition
prep preposition
pron pronoun
verb verb
vpar verb particle
For language identifiers, I have found a combination of ISO 639-2 collective codes and ISO 639-3 and ISO 639-5 codes, supplemented by differentiators of varieties distinguished by lexicographic resources, useful identifiers (http://panlex.org/u). (Safari 5.1 opens pages like this very slowly.)
More information about the open-linguistics
mailing list