[openbiblio-dev] Virtuoso versus 4store

Fri May 13 13:33:05 UTC 2011

* [2011-05-13 11:39:04 +0100] Rufus Pollock <rufus.pollock at okfn.org> écrit:

] I think it is :-) I think we will have big problems if we are not
] running out a relatively standard repo like Debian. It's already a bit
] of a nightmare doing dev and production stuff and having to compile
] from source everytime is too much I think ...

YMMV.

] What happens if we want to enrich the data or correct it (and get the
] wikipedia guys involved in this!). I don't really understand how you
] can do that in a separate backend -- you need it to be the same place
] don't you? (I *can* well understand how this separation works for
] collection stuff).

I think you don't. You can keep your enrichments local, they just
refer to external resources and are yours and that's fine. If you want
to have your own corrected version, you import the record in question
and correct it, then its yours. If you want the corrections to flow
upstream then its legwork to get the publishers to do this.

In any event, all of these are piecemeal operations on the data 
which is the point.

] One of the major things about open (biblio) data is that it would
] allow for 'crowd-participation' in data enrichment and cleaning and
] that implies merging that data back in somewhere central I think (or
] some pretty nasty distributed querying ...?)

As above, enrichment and correction are separate things.

What happens if you notice an error on a web page? Do you try to cache
a local copy of the entire Internet to get it corrected?  Same here.
Where it is impractical to get the sources corrected for whatever
reason, it might be worthwhile to have an errata service that contains
corrections.

Where enrichments are desired, there are many types of enrichments
that one would want to make. Most of them are not appropriate for
incorporation in the ground data. So you could have a variety of
datasets containing enrichments.

A good example of this is dbpedia. They distribute their links to
other things like Yago and Opencyc separately. These are enrichments
that you may or may not want to incorporate into the infobox data.

Dbpedia is a dataset roughly the same size (about twice as big) as
the BL data. With our two major datasets we have about three times
the size of dbpedia. The dbpedia folks have more computing resources
than we do and in any case only expose a read-only set of reference
data. Improvements and corrections wander the long way around through
Wikipedia and eventually to the next release. 

] Take 4 very concrete but simple use cases:
] 
] * I want to correct a typo in a text title

Add this to the errata service or maintain a local copy while 
waiting for the upstream to get fixed.

] * I want to link items in the catalogue to material available online
] (e.g. gutenberg text of that book, related wikipedia article). Do I do
] this somewhere completely separate (in which case do I need 2 virtuoso
] instances?)

Yes, completely separate. And this can be in whatever way you want,
you might use appengine and keep these links in bigtable if you want
or couchdb, or whatever, as long as you are nice and expose them as
RDF.

] * I want to merge duplicates of authors or entries

So this is what something like sameAs.org is for. You could maintian
a crowdsourced service of similar character.

] * I want to generate "Works" (to complement the 'manifestations' that
] are Entries)

This is controversial enrichment that will not necessarily be
appropriate to incorporate into the source data which is about
catalogue records (which is what libraries have). So this would be
another dataset.

] * I wan to show public domain status of entries in the catalogue

You maintain a service that retrieves information from the relevant
place and does the calculation. You could display this when browsing
various catalogues themselves with a plugin or bookmarklet kind of
thing.

] Yes and no. I understand that recently it is some write issues that
] have e.g. removed sparql endpoint but data was made available in bulk
] very soon after getting it and was in virtuoso pretty quickly. Most of
] the dev work on application has not blocked pure availability but has
] been the necessary work to build apps, develop usable json apis etc
] (in some sense orthogonal to pure availability).

I'm saying that in my opinion most of the post-bibliographica
development path, trying to turn things into a LAMP application has
been mistaken. Note that I don't say it hasn't been useful, it has
been a learning experience.

Requiring developers to have commit access to our repository and to
be able to update our running service in order to do anything is too
much of a burden on us and them.

] 1. Use virtuoso (it worked and we don't need 4store).
] 2. Reload data (using http://bibliographica.org/ uri base) and index in solr
] 3. Disable sparql write permission and reopen sparql endpoint.
] Generally optimize for read.

Yes, more from momentum than any other reason.

] 1. Start working to port our existing stuff into a separate simple app
] that uses its own, separate, triple store or even just sql ...
]  * We can reuse most of our work here
] 
] [2. Look at bulk data analysis]

Agreed.

Cheers,
-w

-- 
William Waites                <mailto:ww at styx.org>
http://river.styx.org/ww/        <sip:ww at styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45