[openbiblio-dev] Fwd: Cataloguing Bibliographic Data with Natural Language and RDF

Mon Aug 9 21:34:34 UTC 2010

On 10-08-09 19:54, Peter Murray-Rust wrote:
> We do  a lot of NLP here so am happy to give advice.

Thanks! I studied a little bit of this sort of thing at school aeons ago
but my recollections are fuzzy and vague at best so I'm learning all
this anew. I should mention that, at the moment at least, this is a
research project off the critical path of the main bibliographic data
stuff, but it is fun!

>     Before going into implementation details, here's an example of a
>     session:
>
>     [...]
>         
>
> You don't say who generated this. The breadth of language in your
> discourse is the key problem. Is this created by YOU, a set of people
> you are working with or others. If it's under your control you could
> try to steer the community towards a "toy" language - this helps a lot.

I'm responsible for the input, but I have the robot listening to IRC
channels and whenever I see a construction that looks interesting and
achievable I try to extend the grammar to cover it. No illusions that it
will ever be a general coverage grammar though. If this were to be a
"sanctioned" interface to bibliographica, I'd imagine the input dialogue
page to contain a bunch of example statements that the machine could
understand and a YMMV caveat about straying too far.

> This is known as anaphora
> http://en.wikipedia.org/wiki/Anaphora_%28linguistics%29
> If it's under your control then try to make the discourse avoid
> pronouns. If not you will struggle with:
> "The dog ate part of the book. It is difficult to read it"

Yes. Even without the second "it" sembot will conclude that "the dog is
difficult to read"...

> Another difficult topic is negation:
> "we found it impossible to locate this book"

I wonder how deeply wordnet goes into antonyms. A first pass might be to
look and see if (1) a word has only one synset and then have an
inference rule to generate the negations.

In any event I haven't looked at resolving pronouns within a statement
at all yet, which is obviously necessary for cases like this.

> unless the sentence is very deeply parsed the negation is missed.

Yet harder is that what the speaker probably meant was, "we found it
very difficult to locate this book". I'm not sure where we get to when
we start trying to interpret what the speaker meant as opposed to what
they said.

> NLP is ALWAYS more difficult than you think. I jumped in about 8 years
> ago and thought regexes would solve chemistry. We now use various POS
> taggers and other tools.

The more I think about it the more amazing it is that humans can do this
at all -- even my two year old

> If you are just sticking to Named Entities it can be a lot easier.

I guess for practical purposes the theory is that since RDF is about
statements, writing RDF should be about making statements and the
natural way for humans to make statements is to write them down. N3
tries to be marginally more natural language like than XML but expecting
scholars and librarians to annotate their bibliographies by writing N3
is a non-starter. At the same time the idea of a generic web form for
writing statements seems unnaturally circuitous so we're forced down the
road of hand-coding forms which is tedious both for the programmer and
the user and doesn't scale well...

> This is not saying that it's not a good idea. But you have to
> understand your domain of discourse, you have to have a  clear idea or
> what precision and what recall is acceptable. Unless the language is
> highly stylised 60% would be a good start. There is always a long tail
> and you can never eliminate it

Today I was extending the grammar and I had a nice surprise. The grammar
itself knows nothing about the subject domain, it just knows about a
very small subset of the English language. There was a sentence, I
forget what it was exactly but it had same shape as the proverbial koala
that eats shoots and leaves (at least I think it was a koala). The
sentence resulted in two alternative syntax trees, as it should. But the
inference rules only fired on one of them and the resulting inferred
statements were correct. The thing is, the inference rules encode
information about the subject domain. And its the inferred meaning that
we are interested in not the syntax trees so this was a very nice surprise.

But yes, you are quite right. I am under no illusions that the bot is
going to be able to extract much information out of free-form text.
Running the text of this email through it would undoubtedly result in a
success rate at parsing of far less than 60%, nevermind drawing
inferences from it. It's simply intended as a tool to make it easier to
create structured data in a more natural way -- by people intending to
create structured data.

Cheers,
-w

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20100809/ff1be8d2/attachment-0001.html>