[openbiblio-dev] Fwd: Cataloguing Bibliographic Data with Natural Language and RDF
Peter Murray-Rust
pm286 at cam.ac.uk
Mon Aug 9 18:54:03 UTC 2010
William
We do a lot of NLP here so am happy to give advice.
On Mon, Aug 9, 2010 at 1:21 PM, William Waites <william.waites at okfn.org>wrote:
>
> -------- Original Message -------- Subject: Cataloguing Bibliographic
> Data with Natural Language and RDF Date: Mon, 09 Aug 2010 13:20:23 +0100
> In the grand tradition of W3C IRC bots, I've started some speculative work
> on a robot that tries to understand natural language descriptions of works
> and their authors and generates RDF. It is written in Python and uses ORDF<http://ordf.org/>,
> the NLTK <http://www.nltk.org/> and FuXi <http://code.google.com/p/fuxi>.
>
> Before going into implementation details, here's an example of a session:
>
> 12:41 < ww> biblio forget
> 12:41 < biblio> ww: ok
> 12:41 < ww> Solzhenitsyn's name is "Aleksander Isayevitch Solzhenitsyn"
> 12:42 < ww> He was born on December 11th 1918
> 12:42 < ww> He died on August 3rd 2008
> 12:42 < ww> He wrote TFC in 1968
> 12:42 < ww> TFC's title is "The First Circle"
> 12:42 < ww> "YMCA"'s name is "YMCA Press"
> 12:42 < ww> They published TFC in 1978
> 12:42 < ww> biblio think
> 12:42 < biblio> ww: I learned 25 things in 0:00:00.218296
> 12:42 < ww> biblio paste
> 12:42 < biblio> ww: http://pastebin.ca/1913826
>
>
You don't say who generated this. The breadth of language in your discourse
is the key problem. Is this created by YOU, a set of people you are working
with or others. If it's under your control you could try to steer the
community towards a "toy" language - this helps a lot.
> The natural language parsing is somewhat simplistic, the kinds of
> grammatical constructions it can understand are limited (but growing), the
> resolution of pronouns (e.g. he, they) only looks at the previous named
> subject and it will get confused if there is more than one pronoun referring
> to a different thing in the same sentence but all of these things can be
> improved.
>
This is known as anaphora
http://en.wikipedia.org/wiki/Anaphora_%28linguistics%29
If it's under your control then try to make the discourse avoid pronouns. If
not you will struggle with:
"The dog ate part of the book. It is difficult to read it"
>
> Broadly, the process follows the following steps:
>
> - (NLTK) Tokenise the sentence and classify for parts of speech
> - Create references for named entities (capitalised words, URIs and
> phrases enclosed in double quotes)
> - (NLTK) Create a lexicon, the part of a grammar that grounds it to
> individual words and append it to the canned grammar that describes the
> structure of sentences. This is a feature grammar not a context-free grammar
> - (NLTK) Parse the input sentences creating a syntax tree with the root
> at the main verb in the sentence
> - The syntax tree is annotated with the logical structure of the
> sentence (see Analysing the meaning of sentences<http://nltk.googlecode.com/svn/trunk/doc/book/ch10.html>).
> This logical representation is cunningly constructed so as to also be
> runnable Python code (with eval<http://docs.python.org/library/functions.html#eval>).
> Running it transforms the syntax tree into an RDF representation.
> - (FuXi) the "biblio think" command causes the RDF of the current
> session to be run through a number of inference rules that encode higher
> level meaning. That if "X wrote Y" then X must be a person, Y must be a work
> and X must have contributed to Y.
>
> The neat bit is really the way it generates RDF, translating a logical
> structure that looks like,
>
> statement(
> predicate(
> bnode(
> rdf_type(umbel("Verb")),
> label("is"),
> racine("be"),
> tense(nlp("Present"))
> ),
> named("aHLIkuXm14335") # "The First Circle"
> ),
> posessive(
> bnode(
> rdf_type(umbel("Noun")),
> label("title"),
> racine("title")
> ),
> named("aHLIkuXm14333") # "TFC"
> )
> )
>
> Another difficult topic is negation:
"we found it impossible to locate this book"
unless the sentence is very deeply parsed the negation is missed.
> and the constituent parts bubble up and return an RDF Graph that looks like
> this:
>
> entity:aHLIkuXm14333 a nlp:NamedEntity;
> rdfs:label "TFC".
>
> entity:aHLIkuXm14335 a nlp:NamedEntity;
> rdfs:label "The First Circle".
>
> [ a umbel:Verb;
> rdfs:label "is";
> lvo:nearlySameAs lve:be;
> nlp:directObject entity:aHLIkuXm14335;
> nlp:subject [ a umbel:Noun;
> rdfs:label "title";
> lvo:nearlySameAs lve:title;
> nlp:owner entity:aHLIkuXm14333];
> nlp:tense nlp:Present].
>
>
> And this sort of structure is the basis for the reasoning step. Provenance
> information, using OPMV <http://open-biomed.sourceforge.net/opmv/ns.html>is also kept, pointing back to the original IRC message that was parsed so
> the entire process should be repeatable.
>
> I suppose since IRC is not necessarily the most accessible of media --
> though I can't really see why -- the same engine could be easily glued to a
> web server with a simple chat-like interface. Perhaps this is easier or more
> natural than web forms. Perhaps not. More research is needed.
>
> In any case, I'm working on improving the natural language parsing and the
> inference rules as time permits so hopefully the robot will become more and
> more clever.
>
NLP is ALWAYS more difficult than you think. I jumped in about 8 years ago
and thought regexes would solve chemistry. We now use various POS taggers
and other tools.
If you are just sticking to Named Entities it can be a lot easier.
>
> Source code for the IRC bot is available at:
> http://bitbucket.org/ww/sembot
>
> You can play with a live version of the bot by joining irc://irc.oftc.net/and joining #okfn or engaging in a private chat with
> *biblio*. It understands the command "sembot help" and I'll try not to
> break it too badly while anyone's playing with it.
>
This is not saying that it's not a good idea. But you have to understand
your domain of discourse, you have to have a clear idea or what precision
and what recall is acceptable. Unless the language is highly stylised 60%
would be a good start. There is always a long tail and you can never
eliminate it
>
>
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20100809/a982bd46/attachment.html>
More information about the openbiblio-dev
mailing list