[okfn-labs] data reconciliation tool

Friedrich Lindenberg friedrich.lindenberg at okfn.org
Sat Sep 29 08:46:27 UTC 2012


Hey Martin,

On Fri, Sep 28, 2012 at 8:12 PM, Martin Keegan <martin.keegan at okfn.org> wrote:
> Thanks - it looks as good as it did in Helsinki

Happy to hear. After discussions with rgrp, I'm strongly considering
changing the names of a few domain objects (especially in the UI which
is confusing), e.g. Link -> Alias and Value ->  Reference or Lemma or
Canonical -- what do you think?

> how does it fit in with PyBossa and Transformer, if at all?
>
> Is it a complement, substitute, neither, or both?

Good question. The reason nomenklatura doesn't plug into pyBossa is
that the user-facing recon only makes up a small part of the app (and
relies on server-side infrastructure such as distance calc and
memcache). Most of the work is providing the API, managing code lists
and allowing for dedup etc.

I think we should use pyBossa for small, self-contained tasks (such as
transcriptions, content extraction etc.). nomenklatura is for data
integration, which (by def), isn't self-contained. In that sense I see
them as very complementary.

As for Transformer, I don't get the app yet - I think you need to
start somewhere more specific (e.g. "date/number format fixers",
"jquery scraping", ...). I'm also not sold on the use of GitHub as a
data store, my sense is still that using some couch update feed-like
thing with CKANs datastore might work better....

> My actual interest is really in having a recon tool which consults
> *remote* sources - I assume it won't be too hard to adapt nomenklatura
> for that.

So the next functional additions I wanted to make to nomenklatura is
CSV import/export. There are several types of imports: import of the
authoritative list, import of recon candidates and import of
pre-defined alias mappings. There is no reason this should not accept
an HTTP url as an input - it's going to be an offline process anyway.
What do you think?

- Friedrich




More information about the okfn-labs mailing list