[openbiblio-dev] Virtuoso versus 4store

Rufus Pollock rufus.pollock at okfn.org
Fri May 13 10:39:04 UTC 2011


On 12 May 2011 17:32, William Waites <ww at styx.org> wrote:
> Copying Soeren since you were just discussing this with him.
>
[...]

> OTOH, it is very easy to make a public endpoint available with 4store,
> it is easy to work with the code and fix bugs when they are
> encountered.  We have found bugs in both, and fixing them in Virtuoso
> means asking their support whereas with 4store we can do it
> ourselves. And you will remember well the extended period we spent
> with Virtuoso's support people getting bugs fixed and running
> snapshots - so the "installable from debian/ubuntu" is not really
> relevant, it is unlikely for us to run those versions anyways.

I think it is :-) I think we will have big problems if we are not
running out a relatively standard repo like Debian. It's already a bit
of a nightmare doing dev and production stuff and having to compile
from source everytime is too much I think ...

> But more importantly, this business of trying to build a silo with the
> world's library catalogue in it and then do stuff is wrong and is not
> the way that linked data is meant to work. What we *should* be doing
> is making sure that the basic ground data is available and
> queriable. That requires big iron and few moving parts and is not
> something that OKF is particularly in a position to do. Then on top of
> those data sources you make apps. So an app like bibliographica would
> pull in records from different places *as needed*. It should not
> preemptively try to have its own copy of everything. This means that
> complex fiddly application software is separate in terms of
> infrastructure and requires for itself few resources.

I like this logic but ...

What happens if we want to enrich the data or correct it (and get the
wikipedia guys involved in this!). I don't really understand how you
can do that in a separate backend -- you need it to be the same place
don't you? (I *can* well understand how this separation works for
collection stuff).

One of the major things about open (biblio) data is that it would
allow for 'crowd-participation' in data enrichment and cleaning and
that implies merging that data back in somewhere central I think (or
some pretty nasty distributed querying ...?)

Take 4 very concrete but simple use cases:

* I want to correct a typo in a text title

* I want to link items in the catalogue to material available online
(e.g. gutenberg text of that book, related wikipedia article). Do I do
this somewhere completely separate (in which case do I need 2 virtuoso
instances?)

* I want to merge duplicates of authors or entries

* I want to generate "Works" (to complement the 'manifestations' that
are Entries)

* I wan to show public domain status of entries in the catalogue

> Thinking in terms of a "web application" that uses a "back end" is the
> mental straightjacket that is causing this pain. Forcing complicated
> application code between the data and the API makes it hard to work
> with. This was never supposed to be a LAMP stack that uses a
> triplestore instead of RDBMS.

OK. Can you explain though how we would be build the features
described above in this model?

> In terms of deliverables for the jiscobib project, we have a large
> corpus of open linked data that we can now make available. What is
> stopping us from making it available is fiddly application stuff. What

Yes and no. I understand that recently it is some write issues that
have e.g. removed sparql endpoint but data was made available in bulk
very soon after getting it and was in virtuoso pretty quickly. Most of
the dev work on application has not blocked pure availability but has
been the necessary work to build apps, develop usable json apis etc
(in some sense orthogonal to pure availability).

> we should be doing is making it available and having a clean
> separation between it and the fiddly application stuff.
>
> That this is what we should be doing is what we have learned from the
> project.
>
> So I have no objection to making the BL data available using either
> 4store or Virtuoso. In fact I want that to be the BL's decision and
> responsibility. In the meantime we can do it for them and in this case
> most likely use Virtuoso because it is already there. But I want to go
> back to the *original* idea of making it available in a consistent and
> standard way and not muddy the waters with application code. That way
> you or anyone else can write what applications they like using
> whatever local data stores they like.
>
> Make sense?

Very much so -- I think this has been a very useful email. While I
have questions above I think I'm in broad agreement with this
separation.

To summarize:

1. Use virtuoso (it worked and we don't need 4store).
2. Reload data (using http://bibliographica.org/ uri base) and index in solr
3. Disable sparql write permission and reopen sparql endpoint.
Generally optimize for read.

Separately:

1. Start working to port our existing stuff into a separate simple app
that uses its own, separate, triple store or even just sql ...
 * We can reuse most of our work here

[2. Look at bulk data analysis]

Rufus




More information about the openbiblio-dev mailing list