[open-linguistics] Wiktionary RDF-extraction with DBpedia for en and de

Jonathan Pool pool at utilika.org
Fri Dec 23 07:18:27 UTC 2011


The University of Washington Turing Center extracted data from Wiktionaries for the TransGraph database about 2006, but didn't publish its methods for re-use by others.

If your extractor were extended to cover word classes, definitions, and translations, I could use its output as input to PanLex and thereby better integrate Wiktionary data with data from other resources (http://utilika.org/info/plrefs.shtml).

For word-class categories, it seems to me that the OLIF list (in 3.2.1 on page 14 of http://www.olif.net/documents/NewOLIFstruct&content.pdf) resembles more than the GOLD list the categories that generally appear in conventional lexicographic resources. In PanLex, we have somewhat extended the OLIF list to:

adjv	adjective
advb	adverb
affx	affix
auxv	auxiliary verb
conj	conjunction
detr	determiner
ijec	interjection
misc	miscellaneous
name	proper noun
noun	noun
post	postposition
prep	preposition
pron	pronoun
verb	verb
vpar	verb particle

For language identifiers, I have found a combination of ISO 639-2 collective codes and ISO 639-3 and ISO 639-5 codes, supplemented by differentiators of varieties distinguished by lexicographic resources, useful identifiers (http://panlex.org/u). (Safari 5.1 opens pages like this very slowly.)



More information about the open-linguistics mailing list