[openbiblio-dev] Fwd: Cataloguing Bibliographic Data with Natural Language and RDF

Peter Murray-Rust pm286 at cam.ac.uk
Mon Aug 9 22:41:03 UTC 2010

I won't distract the list but some final comments.

If bibliographica or #jiscopenbib get to parse abstracts then NLP can be
very important. I would much rather thave machines annotate abstracts and
use them for Machine Learning than require them to be formally catalogued by
humans. I think machines will be far more costs-effective at assigning
domain classifications. We shall still need humans to assign provenance and
a few other terms

On Mon, Aug 9, 2010 at 10:34 PM, William Waites <william.waites at okfn.org>wrote:

>  On 10-08-09 19:54, Peter Murray-Rust wrote:Today I was extending the
> grammar and I had a nice surprise. The grammar itself knows nothing about
> the subject domain, it just knows about a very small subset of the English
> language.

In restricted domains you can go a long way with POS tagging. It's much
easier to parse chemistry than human discourse.

> There was a sentence, I forget what it was exactly but it had same shape as
> the proverbial koala that eats shoots and leaves (at least I think it was a
> koala).

> The sentence resulted in two alternative syntax trees, as it should. But
> the inference rules only fired on one of them and the resulting inferred
> statements were correct. The thing is, the inference rules encode
> information about the subject domain. And its the inferred meaning that we
> are interested in not the syntax trees so this was a very nice surprise.

POS tagging doesnot create  syntax trees. You need a rules-based engine. in
chemistry there is so little ambiguity we build our own. For more natural
language you have to create a TreeBank of possible parses and get humans to
annotate them:

... Time flies like an arrow
... Fruit flies like a banana

The second requires a semantic grammar that recognises "fruit fly" and knows
it eats fruit and that a banana is a fruit

> But yes, you are quite right. I am under no illusions that the bot is going
> to be able to extract much information out of free-form text. Running the
> text of this email through it would undoubtedly result in a success rate at
> parsing of far less than 60%, nevermind drawing inferences from it. It's
> simply intended as a tool to make it easier to create structured data in a
> more natural way -- by people intending to create structured data.


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20100809/9e5bf530/attachment.html>

More information about the openbiblio-dev mailing list