[openbiblio-dev] New instance: eu11.okfn.org

Rufus Pollock rufus.pollock at okfn.org
Mon Nov 15 15:29:47 UTC 2010

On 15 November 2010 12:09, William Waites <ww at eris.okfn.org> wrote:
> * [2010-11-15 11:43:39 +0000] Rufus Pollock <rufus.pollock at okfn.org> écrit:
> ]
> ] That's still very slow (i.e. 59h to do the whole lot!). Can one turn
> ] off transactions or the like for bulk uploading to speed it up?
> I could turn the indexing off (particularly the FTS index)
> but then it would just have to be built after the load
> anyways. Might be marginally faster.

May be worth doing though realize I may an order of 10 error in my
time estimates so this may not be such a big deal.

> Also note that the data that we have is about 3 million
> records. We don't have the entire 30 million. So, modulo
> the odd record that stops the import (about one every
> million records, so not a big deal to handle manually)
> it should take about 5 hours for the whole import. I
> think it's safe enough to say that all the data will
> be loaded by tomorrow.

Yes, that's good. I got an extra 0 in my original calculation in a
python shell so got 60h!

> ] Also are you doing any de-duping (at least on entities)? (Since we're
> ] creating them may be sensible to dedupe as part of upload ...)
> Entities are given http://bibliographica.org/entity/hash(name)
> as URIs, this is what Ben did for them, and presumes that
> the authority records are unambiguous. A more detailed
> analysis of this (and other properties of the dataset)
> is the next step once the data is loaded and queriable.

Isn't hash(name + birth_data + death_date) ? I thought that was what
was guaranteed to be unique.

> In general we do *not* want to do any invasive operations
> like trying to dedup as part of the upload. Far better
> to keep things separate and simple. The only modifications
> to the data so far are some basic vocabulary cleanup and
> the addition of some owl:sameAs links for the entities
> and for ISBNs.

I agree we do not want to dedup actual catalogue entries -- just the
entities we are creating.


More information about the openbiblio-dev mailing list