[okfn-labs] The future of Nomenklatura
Adi Eyal
adi at code4sa.org
Mon Nov 11 18:31:57 UTC 2013
Entity resolution is tough but organisations are probably the most
difficult. Regardless, you always need more context than just the
name. David, in your example, you would need country at the very least
to get any sane matches. Also, before you attempt matching, you almost
always need to clean your data - e.g. uk => United Kingdom, +44 xxx
xxx xxxx might become +44xxxxxxxxxx, etc. Also, notice with the
country example, naive fuzzy matching usually doesn't work.
Adi
On 11 November 2013 20:06, David Read <david.read at hackneyworkshop.com> wrote:
> Friedrich,
>
> Although pooling sounds good for some sorts of entities, it might not
> be so simple for public bodies. If a dataset refers to "Department of
> Education" or "Department of Justice" then they may mean the US Bodies
> of those names, or the Northern Ireland bodies also of exactly those
> names, or may simply have actually meant the UK bodies "Department FOR
> Education" or "Ministry of Justice". I even have a metadata provider
> that shortens the publisher to simply "Education" and "Justice". So
> actually the country that a public body is attached to is crucial to
> reconciliation.
>
> It suggests the value in segmenting by country.
>
> David
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
--
Adi Eyal
Director
Code for South Africa
Promoting informed decision-making
phone: +27 78 014 2469
skype: adieyalcas
linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
web: http://www.code4sa.org
twitter: @soapsudtycoon
For more information on how to participate in the open data community
in South Africa, go to: http://www.code4sa.org/#community
More information about the okfn-labs
mailing list