[okfn-labs] The future of Nomenklatura

Mon Nov 11 10:38:03 UTC 2013

Friedrich,

As one of the users of Nomenklatura in the past (and sometimes having my
fingers in the code). I do think this is a good plan. One of the hardest
things to do was the never ending re-conciliation one by one: so a cluster
feature definitely helps.

Also having uniform URIs for entities is a often wanted task. Totally
agree.

Michael

On Sat, Nov 09, 2013 at 03:04:32PM +0100, Friedrich Lindenberg wrote:
> Hi all,
> 
> while I've been using nomenklatura successfully in a variety of services
> for the past couple of months, it hasn't really spread and found more
> users. At the same time, I'm beginning to meet it's limitations with larger
> datasets.
> 
> 
> Problems with nomenklatura
> --------------------------
> 
> Some of the problems that people have reported have been about
> understanding what  the service does in the first place, as well as the
> quality of the current implementation (e.g. the upload has been partially
> broken and its UI cryptic).
> 
> Beyond that, there are several limitations to nomenklatura. One is the lack
> of a clustering mechanism. The tool only compares entity labels one-to-one,
> rather than trying to create larger groups - like, for example, Refine does
> in its "Cluster & Edit" mode. This makes it harder to crunch large datasets
> effectively.
> 
> At the same time, nomenklatura's notion of datasets prevents the service
> from helping users to discover links across datasets - e.g. a list of all
> EU lobbyists might overlap with those companies competing for EU tenders.
> 
> 
> Proposed approach
> -----------------
> 
> To tackle these issues and to make nomenklatura more attractive for new
> users, I'm considering a fairly radical re-framing of the service. This
> would include the following changes:
> 
> * Limit the semantics of the services to only recognize social entities,
> ie. people, companies, public bodies and similar items. This should help
> clarify the use case and make the service easier to understand.
> * Create a global ID space and generate one URI per entity, independent of
> its source dataset.
> * Replace datasets with "contexts", where one entity can be part of
> multiple contexts.
> * Build out a clustering mode inspired by Refine that can work either
> within a context or globally.
> * Use Popolo-inspired microformats to store further attributes for each
> entity.
> 
> Technically, this would be accomplished by:
> 
> * Switching to MongoDB for storage
> * Re-building the UI in AngularJS
> 
> The advantages of this approach would be:
> 
> * Creates links between datasets, aiming towards a flexible, re-usable
> entity namespace.
> * Provide a richer set of entities to cluster with, thus hopefully better
> data integration.
> * Could more easily serve as a backend to publicbodies.org
> 
> I'm keen to hear what people thing about this kind of plan, and if anyone
> wants to contribute to such an effort - or knows about existing efforts
> that this could pair up with!
> 
> Cheers,
> 
> - Friedrich

> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs

-- 
Data Diva | skype: mihi_tr | @mihi_tr
The Open Knowledge Foundation | School of Data
http://okfn.org | http://schoolofdata.org 
GPG/PGP key: http://tentacleriot.eu/mihi.asc