[openbiblio-dev] Virtuoso versus 4store

Fri May 13 11:36:52 UTC 2011

Just a few things that may be relevant... in terms of thinking about how we
build applications on top of linked data I think this summary
http://linkeddatabook.com/editions/1.0/#htoc84 by Tom Heath and Christian
Bizer is helpful

The 'Discovery' (was RDTF) http://rdtf.mimas.ac.uk/ programme is all about
infrastructure and aggregation of resource descriptions, so it may well be
worth talking to the JISC/MIMAS people about this aspect.

Not sure what criteria are for choice of triplestore? Is there any reason to
be just talking about Virtuoso and 4store? If any interest we've been using
SwiftOWLIM for the Lucero project - although not dealing with the volume of
data that bibliographica is, and have been experimenting with BigOWLIM in
the background (both available freely, although the latter has charges for
'real world' use - as opposed to research, evaluation and development
purposes when it is free). One of the attractions was the support for Sesame
which (we hope) will make interactions with the triplestore more
transportable should we change in the future.

Owen

On Fri, May 13, 2011 at 11:39 AM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> On 12 May 2011 17:32, William Waites <ww at styx.org> wrote:
> > Copying Soeren since you were just discussing this with him.
> >
> [...]
>
> > OTOH, it is very easy to make a public endpoint available with 4store,
> > it is easy to work with the code and fix bugs when they are
> > encountered.  We have found bugs in both, and fixing them in Virtuoso
> > means asking their support whereas with 4store we can do it
> > ourselves. And you will remember well the extended period we spent
> > with Virtuoso's support people getting bugs fixed and running
> > snapshots - so the "installable from debian/ubuntu" is not really
> > relevant, it is unlikely for us to run those versions anyways.
>
> I think it is :-) I think we will have big problems if we are not
> running out a relatively standard repo like Debian. It's already a bit
> of a nightmare doing dev and production stuff and having to compile
> from source everytime is too much I think ...
>
> > But more importantly, this business of trying to build a silo with the
> > world's library catalogue in it and then do stuff is wrong and is not
> > the way that linked data is meant to work. What we *should* be doing
> > is making sure that the basic ground data is available and
> > queriable. That requires big iron and few moving parts and is not
> > something that OKF is particularly in a position to do. Then on top of
> > those data sources you make apps. So an app like bibliographica would
> > pull in records from different places *as needed*. It should not
> > preemptively try to have its own copy of everything. This means that
> > complex fiddly application software is separate in terms of
> > infrastructure and requires for itself few resources.
>
> I like this logic but ...
>
> What happens if we want to enrich the data or correct it (and get the
> wikipedia guys involved in this!). I don't really understand how you
> can do that in a separate backend -- you need it to be the same place
> don't you? (I *can* well understand how this separation works for
> collection stuff).
>
> One of the major things about open (biblio) data is that it would
> allow for 'crowd-participation' in data enrichment and cleaning and
> that implies merging that data back in somewhere central I think (or
> some pretty nasty distributed querying ...?)
>
> Take 4 very concrete but simple use cases:
>
> * I want to correct a typo in a text title
>
> * I want to link items in the catalogue to material available online
> (e.g. gutenberg text of that book, related wikipedia article). Do I do
> this somewhere completely separate (in which case do I need 2 virtuoso
> instances?)
>
> * I want to merge duplicates of authors or entries
>
> * I want to generate "Works" (to complement the 'manifestations' that
> are Entries)
>
> * I wan to show public domain status of entries in the catalogue
>
> > Thinking in terms of a "web application" that uses a "back end" is the
> > mental straightjacket that is causing this pain. Forcing complicated
> > application code between the data and the API makes it hard to work
> > with. This was never supposed to be a LAMP stack that uses a
> > triplestore instead of RDBMS.
>
> OK. Can you explain though how we would be build the features
> described above in this model?
>
> > In terms of deliverables for the jiscobib project, we have a large
> > corpus of open linked data that we can now make available. What is
> > stopping us from making it available is fiddly application stuff. What
>
> Yes and no. I understand that recently it is some write issues that
> have e.g. removed sparql endpoint but data was made available in bulk
> very soon after getting it and was in virtuoso pretty quickly. Most of
> the dev work on application has not blocked pure availability but has
> been the necessary work to build apps, develop usable json apis etc
> (in some sense orthogonal to pure availability).
>
> > we should be doing is making it available and having a clean
> > separation between it and the fiddly application stuff.
> >
> > That this is what we should be doing is what we have learned from the
> > project.
> >
> > So I have no objection to making the BL data available using either
> > 4store or Virtuoso. In fact I want that to be the BL's decision and
> > responsibility. In the meantime we can do it for them and in this case
> > most likely use Virtuoso because it is already there. But I want to go
> > back to the *original* idea of making it available in a consistent and
> > standard way and not muddy the waters with application code. That way
> > you or anyone else can write what applications they like using
> > whatever local data stores they like.
> >
> > Make sense?
>
> Very much so -- I think this has been a very useful email. While I
> have questions above I think I'm in broad agreement with this
> separation.
>
> To summarize:
>
> 1. Use virtuoso (it worked and we don't need 4store).
> 2. Reload data (using http://bibliographica.org/ uri base) and index in
> solr
> 3. Disable sparql write permission and reopen sparql endpoint.
> Generally optimize for read.
>
> Separately:
>
> 1. Start working to port our existing stuff into a separate simple app
> that uses its own, separate, triple store or even just sql ...
>  * We can reuse most of our work here
>
> [2. Look at bulk data analysis]
>
> Rufus
>
> _______________________________________________
> openbiblio-dev mailing list
> openbiblio-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openbiblio-dev
>

-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: owen at ostephens.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20110513/b7320a9c/attachment.html>