[open-linguistics] [ANN] Wiktionary RDF-extraction with DBpedia for en and de

Wed Dec 21 14:13:04 UTC 2011

Dear all,

the past months Sebastian and I worked on a DBpedia-based extractor for
Wiktionary. The main goal was to create one that is so configurable,
that applying it to the different languages-versions of Wiktionary is
just a matter of configuration but not programming. And the
configuration should be possible to do for someone that has a good
understanding of the Wiki syntax (and currently XML, but we plan to hide
that too, via a web-based frontend similar to the mappings-wiki) but not
Scala or RDF.

We now have configs and dumps at the example of the english and german wiktionary, to show the state of our development and
initiate a discussion about design and implementation. If you are not
interested in the technical stuff you may skip the detailed
description below and just evaluate the dump. 
English contains 16M triples and took 9days on my dual core 2GHz laptop; thats 4 articles per second.
German contains 1.3M triples and took 3h (German has only 300k articles, whereas English has 3M and the config is smaller). We know that there is some "noise" in the data (incorrect parsed data), we fix that in the next weeks.
The source code is in the "wiktionary" branch of the dbpedia svn-repo, the dump files can be downloaded here:
http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_en.nt.bz2
http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_de.nt.bz2

The idea is somewhat different from dbpedia (although we use the framework): 
Instead of infoboxes and very specific extractors we tried to make a meta-extractor of declarative nature instead of imperative. 
The rational is that although there exist some scrapers, none of them allows to parse more than a 3-5 languages.
So we encode the language-specific characteristics of each Wiktionary in a machine-readable format (e.g. the "config-de.xml").
Top-Down these properties are:

* the Entry Layout (EL) 
  e.g. in the german Wiktionary, a given page has the structure: Lexical
Entity -> languages it occurs in -> part of speech it is used as ->
different senses/meanings -> properties like synonyms, example sentence,
pronunciation, translations, etc.
In the english Wiktionary there is a etymology section after the language.
These structure implies the schema for our extracted ontology (an ER-Diagram). 
More about the EL: http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained
We configure the EL with nested XML-nodes and define which URIs to be used.
The EL does differ greatly between wiktionaries. 
The question is: how do we assure a common schema?
currently we leave that open - the schema of the resulting RDF is
implicitly inferred from the PS. Either we come up with a good idea on
how to transform it automatically (how to configure it easily to
auto-transform) or we leave that merging step open to specialized tools. Which schema should be the global one? Lemon?

* Context defining markers
  big word, it just means occurring label within the EL. a
made up example:
  == green (Englisch) ==
  === Adjektiv ===
  [1] eine Farbe
  [2] umweltfreundlich
In this wiki snippet there are two CDM: "Englisch" and "Adjektiv". They
indicate that the following section is about a english word and its part
of speech is adjective. Obviously we need a mapping from them to a
shared vocabulary. Such a mapping is easy and part of the configuration.
But a nice thing to have would be a ontology backing of this vocabulary.
some ontology about PoS (GOLD?) and language families (ISO 639-3?)) - we
should discuss what to use there.

* wiki templates
  now we come to the core of the extraction. We made an engine that can
match a given Wiktionary page to several "extraction templates" (ET). An ET is Wiki syntax but it can contain
placeholders/variables and controlsymbols that indicate the possible
repeating of parts (like the regex "(ab)*" matches "ababab"). The engine
then fills the placeholders with information scraped from the page (in
other words binds variables). The configuration contains declaration on
what to do with the bound variables, often that is a "use it as literal
object of predicate x" but we imagine more complex transformations there
like "format to URI by ..." or "call a static method y".
An example: we have the wiki snippet from above (green) as the page and
we defined a ET like this:
  <template name="definitons">
    <vars>
      <var name="definition" property="rdfs:comment" />
    </vars>
    <wikiSyntax>([$id] $definition
      )*
</wikisyntax>
  </template>
The syntax looks like Regular Expressions but we only allow ()*, ()+, ()?. 
Then you will notice the variables/placeholders: the extractor will determine whats on the actual page for them. 
The engine finds a set of bindings:
  definition -> "eine Farbe"
  definition -> "umweltfreundlich"
and then generates triples according to the config
  wiktionary:green-english-adjective rdfs:comment "eine Farbe" .
  wiktionary:green-english-adjective rdfs:comment "umweltfreundlich" .

the used properties (rdfs:comment is just a made up example) and
namespaces are open to discussion. 

Our prototype recognizes the EL and thus gives information about Languages and PoS usages of all
words in the Wiktionary and has ETs for the definitions, hyphenation and example sentences.

The next steps will be either expanding it to more languages or first going deeper within the german and english Wiktionary: finding synonyms (to have community based wordnet) and translations.

So what do you think? What are important things to keep in mind, wishes,
comments etc?

Regards,
Jonas