[okfn-labs] The future of Nomenklatura

Adi Eyal adi at code4sa.org
Mon Nov 11 14:44:16 UTC 2013


I have never used Nomenklatura so ignore this message if it is out of
place. I have however had a lot of experience with entity resolution.
There is a problem that you need to watch out for. I call it the
transitivity problem.

Let's say that you have 3 entities, A, B and C. Also, let's say that
you have some scoring threshold, call it x, above which you classify
two entities to refer to the same person and similarly another
threshold below which you decide that two entities are definitely not
the same, call it y.

Note, that x > y but x - y >= 0. In other words, you might have an
area of uncertainty between x and y.

The transitivity problem comes in where

similarity(A, B) >= x
similarity(B, C) >= x
similarity(A, C) <= y

In other words A = B = C != A.

In this case you need to figure out which of the three predicates to believe.

I don't think that there is a hard and fast rule to resolve this. It
depends on how you implement your entity matching algorithm and how
you define entity equivalence.

Adi

On 11 November 2013 12:38, Michael Bauer <michael.bauer at okfn.org> wrote:
> Friedrich,
>
> As one of the users of Nomenklatura in the past (and sometimes having my
> fingers in the code). I do think this is a good plan. One of the hardest
> things to do was the never ending re-conciliation one by one: so a cluster
> feature definitely helps.
>
> Also having uniform URIs for entities is a often wanted task. Totally
> agree.
>
> Michael
>
> On Sat, Nov 09, 2013 at 03:04:32PM +0100, Friedrich Lindenberg wrote:
>> Hi all,
>>
>> while I've been using nomenklatura successfully in a variety of services
>> for the past couple of months, it hasn't really spread and found more
>> users. At the same time, I'm beginning to meet it's limitations with larger
>> datasets.
>>
>>
>> Problems with nomenklatura
>> --------------------------
>>
>> Some of the problems that people have reported have been about
>> understanding what  the service does in the first place, as well as the
>> quality of the current implementation (e.g. the upload has been partially
>> broken and its UI cryptic).
>>
>> Beyond that, there are several limitations to nomenklatura. One is the lack
>> of a clustering mechanism. The tool only compares entity labels one-to-one,
>> rather than trying to create larger groups - like, for example, Refine does
>> in its "Cluster & Edit" mode. This makes it harder to crunch large datasets
>> effectively.
>>
>> At the same time, nomenklatura's notion of datasets prevents the service
>> from helping users to discover links across datasets - e.g. a list of all
>> EU lobbyists might overlap with those companies competing for EU tenders.
>>
>>
>> Proposed approach
>> -----------------
>>
>> To tackle these issues and to make nomenklatura more attractive for new
>> users, I'm considering a fairly radical re-framing of the service. This
>> would include the following changes:
>>
>> * Limit the semantics of the services to only recognize social entities,
>> ie. people, companies, public bodies and similar items. This should help
>> clarify the use case and make the service easier to understand.
>> * Create a global ID space and generate one URI per entity, independent of
>> its source dataset.
>> * Replace datasets with "contexts", where one entity can be part of
>> multiple contexts.
>> * Build out a clustering mode inspired by Refine that can work either
>> within a context or globally.
>> * Use Popolo-inspired microformats to store further attributes for each
>> entity.
>>
>> Technically, this would be accomplished by:
>>
>> * Switching to MongoDB for storage
>> * Re-building the UI in AngularJS
>>
>> The advantages of this approach would be:
>>
>> * Creates links between datasets, aiming towards a flexible, re-usable
>> entity namespace.
>> * Provide a richer set of entities to cluster with, thus hopefully better
>> data integration.
>> * Could more easily serve as a backend to publicbodies.org
>>
>> I'm keen to hear what people thing about this kind of plan, and if anyone
>> wants to contribute to such an effort - or knows about existing efforts
>> that this could pair up with!
>>
>> Cheers,
>>
>> - Friedrich
>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
> --
> Data Diva | skype: mihi_tr | @mihi_tr
> The Open Knowledge Foundation | School of Data
> http://okfn.org | http://schoolofdata.org
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs



-- 
Adi Eyal
Director
Code for South Africa
Promoting informed decision-making

phone: +27 78 014 2469
skype: adieyalcas
linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
web: http://www.code4sa.org
twitter: @soapsudtycoon

For more information on how to participate in the open data community
in South Africa, go to: http://www.code4sa.org/#community




More information about the okfn-labs mailing list