[openbiblio-dev] Virtuoso versus 4store

Thu May 12 16:32:30 UTC 2011

Copying Soeren since you were just discussing this with him.

So there are two things here. The problem with Virtuoso is not the
python bindings bit its ODBC interface where it is extremely easy to
cause deadlocks. The other problem with Virtuoso is that if you use
its SPARQL interface there is no easy way to have both a read-only and
a read-write one (it is possible with jumping through a bunch of
hoops).

The requirements for RAM with Virtuoso are less, but you will remember
how painful the import process was, even before turning on the FTS for
some predicates. I threw RAM at 4store for the import process but then
once that was done cut back too much (to 4Gb total for the back-end
machines). That is why it was halted. So your comparison is not fair.

OTOH, it is very easy to make a public endpoint available with 4store,
it is easy to work with the code and fix bugs when they are
encountered.  We have found bugs in both, and fixing them in Virtuoso
means asking their support whereas with 4store we can do it
ourselves. And you will remember well the extended period we spent
with Virtuoso's support people getting bugs fixed and running
snapshots - so the "installable from debian/ubuntu" is not really
relevant, it is unlikely for us to run those versions anyways.

But more importantly, this business of trying to build a silo with the
world's library catalogue in it and then do stuff is wrong and is not
the way that linked data is meant to work. What we *should* be doing
is making sure that the basic ground data is available and
queriable. That requires big iron and few moving parts and is not
something that OKF is particularly in a position to do. Then on top of
those data sources you make apps. So an app like bibliographica would
pull in records from different places *as needed*. It should not
preemptively try to have its own copy of everything. This means that
complex fiddly application software is separate in terms of
infrastructure and requires for itself few resources.

Thinking in terms of a "web application" that uses a "back end" is the
mental straightjacket that is causing this pain. Forcing complicated
application code between the data and the API makes it hard to work
with. This was never supposed to be a LAMP stack that uses a
triplestore instead of RDBMS.

In terms of deliverables for the jiscobib project, we have a large
corpus of open linked data that we can now make available. What is
stopping us from making it available is fiddly application stuff. What
we should be doing is making it available and having a clean
separation between it and the fiddly application stuff.

That this is what we should be doing is what we have learned from the
project.

So I have no objection to making the BL data available using either
4store or Virtuoso. In fact I want that to be the BL's decision and
responsibility. In the meantime we can do it for them and in this case
most likely use Virtuoso because it is already there. But I want to go
back to the *original* idea of making it available in a consistent and
standard way and not muddy the waters with application code. That way
you or anyone else can write what applications they like using
whatever local data stores they like.

Make sense?

-w

* [2011-05-12 16:20:56 +0100] Rufus Pollock <rufus.pollock at okfn.org> écrit:

] There has been recent discussion with Will Waites about what we use as
] our backend for openbiblio.
] 
] Rough summary (Will knows more so I am sure he can add):
] 
] 1. We have used 4store and virtuoso. Both have been quite painful to install.
] 2. We switch to virtuoso as default ~ 6 months ago
] 3. We have encountered a show-stopper bug in virtuoso python bindings
] 3 months ago. This is still not resolved AFAIK.
] 4. This necessitated rewriting code to use Virtuoso sparql interface.
] The problem with this is there is no way in Virtuoso to distinguish
] GET from UPDATE/DELETE ops in sparql. We therefore had to shut down
] the sparql API.
] 5. Will experimented with a migration back to 4store a few weeks ago.
] We started a production deployment 2 weeks ago but this was halted
] because resource usage seemed very high (2x16GB store plus api machine
] plus web app machine compared to previous 8GB virtuoso store + 1
] machine for webapp). 4store does appear to require a more complex
] production environment and to be more demanding of resources.
] 
] Question: what do we do?
] 
] Secondary question: can we abstract the code so it doesn't care which
] backend it is using?
] 
] In my opinion we should be cautious about switching away again to 4store:
] 
]  * Virtuoso is working
]  * We now have extensive (and tested) documentation on installation
] and deployment
]  * We have experience of Virtuoso working ok.
] 
] That said we don't currently have a SPARQL endpoint (if we could
] somehow restrict write ops via SPARQL we'd be ok again ...). IMO this
] isn't a huge deal *if* we a working solr instance and APIs are
] operational but it would be interesting to know what others thought
] here.
] 
] Rufus
] 
] _______________________________________________
] openbiblio-dev mailing list
] openbiblio-dev at lists.okfn.org
] http://lists.okfn.org/mailman/listinfo/openbiblio-dev

-- 
William Waites                <mailto:ww at styx.org>
http://river.styx.org/ww/        <sip:ww at styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45