[okfn-labs] The future of Nomenklatura

Friedrich Lindenberg friedrich at pudo.org
Sat Nov 9 14:04:32 UTC 2013


Hi all,

while I've been using nomenklatura successfully in a variety of services
for the past couple of months, it hasn't really spread and found more
users. At the same time, I'm beginning to meet it's limitations with larger
datasets.


Problems with nomenklatura
--------------------------

Some of the problems that people have reported have been about
understanding what  the service does in the first place, as well as the
quality of the current implementation (e.g. the upload has been partially
broken and its UI cryptic).

Beyond that, there are several limitations to nomenklatura. One is the lack
of a clustering mechanism. The tool only compares entity labels one-to-one,
rather than trying to create larger groups - like, for example, Refine does
in its "Cluster & Edit" mode. This makes it harder to crunch large datasets
effectively.

At the same time, nomenklatura's notion of datasets prevents the service
from helping users to discover links across datasets - e.g. a list of all
EU lobbyists might overlap with those companies competing for EU tenders.


Proposed approach
-----------------

To tackle these issues and to make nomenklatura more attractive for new
users, I'm considering a fairly radical re-framing of the service. This
would include the following changes:

* Limit the semantics of the services to only recognize social entities,
ie. people, companies, public bodies and similar items. This should help
clarify the use case and make the service easier to understand.
* Create a global ID space and generate one URI per entity, independent of
its source dataset.
* Replace datasets with "contexts", where one entity can be part of
multiple contexts.
* Build out a clustering mode inspired by Refine that can work either
within a context or globally.
* Use Popolo-inspired microformats to store further attributes for each
entity.

Technically, this would be accomplished by:

* Switching to MongoDB for storage
* Re-building the UI in AngularJS

The advantages of this approach would be:

* Creates links between datasets, aiming towards a flexible, re-usable
entity namespace.
* Provide a richer set of entities to cluster with, thus hopefully better
data integration.
* Could more easily serve as a backend to publicbodies.org

I'm keen to hear what people thing about this kind of plan, and if anyone
wants to contribute to such an effort - or knows about existing efforts
that this could pair up with!

Cheers,

- Friedrich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131109/834361cc/attachment-0001.html>


More information about the okfn-labs mailing list