[openbiblio-dev] Exposing as RDF...

Sat Jun 26 23:58:45 UTC 2010

On 10-06-26 17:00, Peter Murray-Rust wrote:
>
>     subject:  a657df1b-565f-4150-99e1-438d3acc22b5
>     predicate: isbd:title & responsibility area
>     object: "$aContre les valeurs bourgeoises $cpar Gilbert Ganne
>     &aPour les valeurs bourgeoises $cpar Georges Hourdin"@fr
>
>
> My opinion (and I am new to this) is that microsyntaxes like this
> cause downstream problems. We have to have a MARC parser as well as an
> RDF parser. And not all or records will be MARC.

Hi all, this has been some very good news on the list
recently! Congratulations on JISCOBIB! I'm just catching
up on email, travelling in Canada at the moment.

Before I left, I refactored the MARC parser and RDF
converter in openbiblio [1]. It is built on top of pymarc,
takes a MARC record and returns an RDFLib Graph
that looks like the attached. There's also a paster command,
"paster load_marc config.ini [options] file.mrc"

I think the process should look something like,

  step 1: transform MARC into a direct representation
  in RDF of the MARC data, as complete as possible (I
  believe our implementation of this is now the most
  complete at the moment)

  step 2: take the RDF/MARC and evolve conceptual
  entities like Work, Manifestation and Item (see previous
  threads on the list about dispensing with Expression).

  step 3: take the Work, Manifestation and Item and
  use various deduplication/matching heuristics to
  calculate a congruent closure expressed with owl:sameAs

Throughout the whole process use the ompv vocabulary
so that it is possible to tell where the source data that
makes up a particular entity came from (semantic
provenance). The code uses the generic ordf [2]
library so has changesets for recording low level
(syntactic) provenance.

I'm working on steps 2 and 3 over the next while, shall
keep the list informed.

I agree that having embedded structure within the object
as in the example isn't the best idea -- you lose the
expressivity of RDF this way and need to implemet
custom parsing logic.

In particular I'm not convinced (assuming we are dealing
with MARC data) that embedding the actual MARC
format stuff in RDF is worthwhile. In the provenance
statements we record the source URI of a MARC file
and the record number within it. So any time someone
is interested in the actual source data they should be
able to simply retrieve it.

As always input and suggestions in the form of comments
or patches are more than welcome!

Cheers,
-w

[1] http://knowledgeforge.net/pdw/openbiblio/file/openbiblio/lib/marc.py
[2] http://ordf.org/

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20100626/57acc3b9/attachment.html>