[openbiblio-dev] Fwd: Cataloguing Bibliographic Data with Natural Language and RDF

Mon Aug 9 18:54:03 UTC 2010

William
We do  a lot of NLP here so am happy to give advice.

On Mon, Aug 9, 2010 at 1:21 PM, William Waites <william.waites at okfn.org>wrote:

>
> -------- Original Message --------  Subject: Cataloguing Bibliographic
> Data with Natural Language and RDF  Date: Mon, 09 Aug 2010 13:20:23 +0100
> In the grand tradition of W3C IRC bots, I've started some speculative work
> on a robot that tries to understand natural language descriptions of works
> and their authors and generates RDF. It is written in Python and uses ORDF<http://ordf.org/>,
> the NLTK <http://www.nltk.org/> and FuXi <http://code.google.com/p/fuxi>.
>
> Before going into implementation details, here's an example of a session:
>
> 12:41 < ww> biblio forget
> 12:41 < biblio> ww: ok
> 12:41 < ww> Solzhenitsyn's name is "Aleksander Isayevitch Solzhenitsyn"
> 12:42 < ww> He was born on December 11th 1918
> 12:42 < ww> He died on August 3rd 2008
> 12:42 < ww> He wrote TFC in 1968
> 12:42 < ww> TFC's title is "The First Circle"
> 12:42 < ww> "YMCA"'s name is "YMCA Press"
> 12:42 < ww> They published TFC in 1978
> 12:42 < ww> biblio think
> 12:42 < biblio> ww: I learned 25 things in 0:00:00.218296
> 12:42 < ww> biblio paste
> 12:42 < biblio> ww: http://pastebin.ca/1913826
>
>
You don't say who generated this. The breadth of language in your discourse
is the key problem. Is this created by YOU, a set of people you are working
with or others. If it's under your control you could try to steer the
community towards a "toy" language - this helps a lot.

> The natural language parsing is somewhat simplistic, the kinds of
> grammatical constructions it can understand are limited (but growing), the
> resolution of pronouns (e.g. he, they) only looks at the previous named
> subject and it will get confused if there is more than one pronoun referring
> to a different thing in the same sentence but all of these things can be
> improved.
>

This is known as anaphora
http://en.wikipedia.org/wiki/Anaphora_%28linguistics%29
If it's under your control then try to make the discourse avoid pronouns. If
not you will struggle with:
"The dog ate part of the book. It is difficult to read it"

>
> Broadly, the process follows the following steps:
>
>    - (NLTK) Tokenise the sentence and classify for parts of speech
>    - Create references for named entities (capitalised words, URIs and
>    phrases enclosed in double quotes)
>    - (NLTK) Create a lexicon, the part of a grammar that grounds it to
>    individual words and append it to the canned grammar that describes the
>    structure of sentences. This is a feature grammar not a context-free grammar
>    - (NLTK) Parse the input sentences creating a syntax tree with the root
>    at the main verb in the sentence
>    - The syntax tree is annotated with the logical structure of the
>    sentence (see Analysing the meaning of sentences<http://nltk.googlecode.com/svn/trunk/doc/book/ch10.html>).
>    This logical representation is cunningly constructed so as to also be
>    runnable Python code (with eval<http://docs.python.org/library/functions.html#eval>).
>    Running it transforms the syntax tree into an RDF representation.
>    - (FuXi) the "biblio think" command causes the RDF of the current
>    session to be run through a number of inference rules that encode higher
>    level meaning. That if "X wrote Y" then X must be a person, Y must be a work
>    and X must have contributed to Y.
>
> The neat bit is really the way it generates RDF, translating a logical
> structure that looks like,
>
>   statement(
>     predicate(
>       bnode(
>         rdf_type(umbel("Verb")),
>         label("is"),
>         racine("be"),
>         tense(nlp("Present"))
>       ),
>       named("aHLIkuXm14335") # "The First Circle"
>     ),
>     posessive(
>       bnode(
>         rdf_type(umbel("Noun")),
>         label("title"),
>         racine("title")
>       ),
>       named("aHLIkuXm14333") # "TFC"
>     )
>   )
>
> Another difficult topic is negation:
"we found it impossible to locate this book"

unless the sentence is very deeply parsed the negation is missed.

> and the constituent parts bubble up and return an RDF Graph that looks like
> this:
>
>  entity:aHLIkuXm14333 a nlp:NamedEntity;
>      rdfs:label "TFC".
>
>   entity:aHLIkuXm14335 a nlp:NamedEntity;
>      rdfs:label "The First Circle".
>
>   [ a umbel:Verb;
>      rdfs:label "is";
>      lvo:nearlySameAs lve:be;
>      nlp:directObject entity:aHLIkuXm14335;
>      nlp:subject [ a umbel:Noun;
>                    rdfs:label "title";
>                    lvo:nearlySameAs lve:title;
>                    nlp:owner entity:aHLIkuXm14333];
>      nlp:tense nlp:Present].
>
>
> And this sort of structure is the basis for the reasoning step. Provenance
> information, using OPMV <http://open-biomed.sourceforge.net/opmv/ns.html>is also kept, pointing back to the original IRC message that was parsed so
> the entire process should be repeatable.
>
> I suppose since IRC is not necessarily the most accessible of media --
> though I can't really see why -- the same engine could be easily glued to a
> web server with a simple chat-like interface. Perhaps this is easier or more
> natural than web forms. Perhaps not. More research is needed.
>
> In any case, I'm working on improving the natural language parsing and the
> inference rules as time permits so hopefully the robot will become more and
> more clever.
>

NLP is ALWAYS more difficult than you think. I jumped in about 8 years ago
and thought regexes would solve chemistry. We now use various POS taggers
and other tools.

If you are just sticking to Named Entities it can be a lot easier.

>
> Source code for the IRC bot is available at:
> http://bitbucket.org/ww/sembot
>
> You can play with a live version of the bot by joining irc://irc.oftc.net/and joining #okfn or engaging in a private chat with
> *biblio*. It understands the command "sembot help" and I'll try not to
> break it too badly while anyone's playing with it.
>

This is not saying that it's not a good idea. But you have to understand
your domain of discourse, you have to have a  clear idea or what precision
and what recall is acceptable. Unless the language is highly stylised 60%
would be a good start. There is always a long tail and you can never
eliminate it

>
>
-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20100809/a982bd46/attachment.html>