[okfn-labs] The future of Nomenklatura
Friedrich Lindenberg
friedrich.lindenberg at okfn.org
Mon Nov 11 19:04:05 UTC 2013
Thanks all, for these helpful comments!
Agreed. In fact, that's why I want to go from Postgres to Mongo, to allow a
semi-controlled set of non-name attributes which I'm hoping to base off the
Popolo Specification (http://popoloproject.com/).
There's some cool open source stuff around SILK (
https://www.assembla.com/wiki/show/silk/Silk_Link_Discovery_Engine) that
does multi-attribute integration/matching, but that would mean opting into
the Java/RDF side of things. For now, dedupe (
https://github.com/open-city/dedupe/) might already do this well enough -
and if not, we could teach it new tricks!
In any case, the key will probably remain a good UI: doing a lot of legwork
in the background, but then we still need to have a user verify and adopt
the results. I'm only hoping they get to do it in nice, big chunks.
Cheers,
- Friedrich
p.s. Anyone want to hack? https://github.com/pudo/odis/issues
On Mon, Nov 11, 2013 at 7:31 PM, Adi Eyal <adi at code4sa.org> wrote:
> Entity resolution is tough but organisations are probably the most
> difficult. Regardless, you always need more context than just the
> name. David, in your example, you would need country at the very least
> to get any sane matches. Also, before you attempt matching, you almost
> always need to clean your data - e.g. uk => United Kingdom, +44 xxx
> xxx xxxx might become +44xxxxxxxxxx, etc. Also, notice with the
> country example, naive fuzzy matching usually doesn't work.
>
> Adi
>
> On 11 November 2013 20:06, David Read <david.read at hackneyworkshop.com>
> wrote:
> > Friedrich,
> >
> > Although pooling sounds good for some sorts of entities, it might not
> > be so simple for public bodies. If a dataset refers to "Department of
> > Education" or "Department of Justice" then they may mean the US Bodies
> > of those names, or the Northern Ireland bodies also of exactly those
> > names, or may simply have actually meant the UK bodies "Department FOR
> > Education" or "Ministry of Justice". I even have a metadata provider
> > that shortens the publisher to simply "Education" and "Justice". So
> > actually the country that a public body is attached to is crucial to
> > reconciliation.
> >
> > It suggests the value in segmenting by country.
> >
> > David
> >
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
>
> --
> Adi Eyal
> Director
> Code for South Africa
> Promoting informed decision-making
>
> phone: +27 78 014 2469
> skype: adieyalcas
> linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
> web: http://www.code4sa.org
> twitter: @soapsudtycoon
>
> For more information on how to participate in the open data community
> in South Africa, go to: http://www.code4sa.org/#community
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131111/3cac265d/attachment-0002.html>
More information about the okfn-labs
mailing list