[okfn-labs] The future of Nomenklatura

Wed Nov 13 11:07:38 UTC 2013

Thanks for the feedback, Rufus - comments inline.

On Wed, Nov 13, 2013 at 9:23 AM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> On 9 November 2013 14:04, Friedrich Lindenberg <friedrich at pudo.org> wrote:
>
>> Hi all,
>>
>> while I've been using nomenklatura successfully in a variety of services
>> for the past couple of months, it hasn't really spread and found more
>> users. At the same time, I'm beginning to meet it's limitations with larger
>> datasets.
>>
>>
>> Problems with nomenklatura
>> --------------------------
>>
>> Some of the problems that people have reported have been about
>> understanding what  the service does in the first place, as well as the
>> quality of the current implementation (e.g. the upload has been partially
>> broken and its UI cryptic).
>>
>
> I do think the UI / tutorial is possibly the bigger blocker - not the
> actual functionality (but I say that as someone who has only used it to a
> limited extent). I've certainly struggled when pointing it out to others to
> find the obvious "getting started" manual.
>

I guess part of it is the UI/lack of tutorials, but also the fact that data
integration is a sincerely hard thing to talk about in the abstract. I
agree with you, though, that both (UX and Tutorials) need to be drastically
extended.

> I wonder if it would be worth, before doing much new coding, to write out
> what a perfect tutorial would look like (focused on nomenklatura to start
> with but perhaps then adding comments about places you'd want to modify).
>

Hm, the train on the "before doing much new coding" thing may have left
already: https://github.com/pudo/odis/graphs/code-frequency

> You could perhaps think of 2 tutorials one for a coder setting up and the
> other for a less technical refine user.
>

I think Michael wrote up the simple API-based tutorial on School Of Data,
but I can't seem to find it any longer - has it been archived anywhere?

> Beyond that, there are several limitations to nomenklatura. One is the
>> lack of a clustering mechanism. The tool only compares entity labels
>> one-to-one, rather than trying to create larger groups - like, for example,
>> Refine does in its "Cluster & Edit" mode. This makes it harder to crunch
>> large datasets effectively.
>>
>> At the same time, nomenklatura's notion of datasets prevents the service
>> from helping users to discover links across datasets - e.g. a list of all
>> EU lobbyists might overlap with those companies competing for EU tenders.
>>
>
> My sense is that the link problem may be something different (though
> important) - and being another big chunk might want to be kept separate to
> start with.
>

This, to be honest, is the problem I have right now - so I want to solve
it. I can live happily ever after without having journalists use
nomenklatura, but if it a) stays a pain to actually clean up the data that
has been submitted and b) I can't do cross-dataset matches then I'm stuck
on the TED data and similar places.

This also makes me wonder: do you have a list of your key user stories -
> that might help clarify what things going into the minimal viable
> enhancement and which don't.
>

Hm, let me have a go. I'm going to assume just one user group, which is
advanced data users - other groups like journalists seem less interested in
this problem still, and also its fun to do something for experts once in a
while :) Doesn't preclude us from unlocking new user groups later...

* As a data expert, I want to submit my data to the service, either through
an API, in bulk (CSV) or, in extreme cases, manually through a UI so that I
can represent my data.

* As a data expert, I want to submit data through the Refine API, so that I
have a local UI from which I can upload data as well.

* As a data expert, I want to run my ETL scripts against the service and
retrieve either definitive matches or match suggestions so that I can
deduplicate and disambiguate entities in my operational data store.

* As a data expert, I want the service to keep track of unmatched queries
from my ETL process so that I can process them later.

* As a data expert, I want to merge identical entities into a common form
individually based on their title and a visual comparison of their
attributes so that I have more unique entities.

* As a data expert, I want to find overlaps between entities in different
datasets so that I can integrate the datasets either for ad-hoc analysis or
to show them in a single application.

* As a data expert, I want the system to suggest likely clusters of
entities so that I can merge them in bulk.

* As a data expert, I want to instruct clustering and similarity rankings
to consider more than merely the label of an entity and to compare along
other attributes so that I will get better match suggestions.

My sense is that these are reasonably clear and coherent, but I'm eager to
hear other opinions or see more use cases :)

Cheers,

 - Friedrich

> Proposed approach
>> -----------------
>>
>> To tackle these issues and to make nomenklatura more attractive for new
>> users, I'm considering a fairly radical re-framing of the service. This
>> would include the following changes:
>>
>
> As above I think writing a proper tutorial for nomenklatura as it stands
> today would be a really valuable use of time before you get into coding but
> I know coding is more fun ;-)
>
>
>> * Limit the semantics of the services to only recognize social entities,
>> ie. people, companies, public bodies and similar items. This should help
>> clarify the use case and make the service easier to understand.
>> * Create a global ID space and generate one URI per entity, independent
>> of its source dataset.
>> * Replace datasets with "contexts", where one entity can be part of
>> multiple contexts.
>> * Build out a clustering mode inspired by Refine that can work either
>> within a context or globally.
>> * Use Popolo-inspired microformats to store further attributes for each
>> entity.
>>
>
> May be useful to articulate the user stories behind each of these (a bit).
> This seems quite "meaty" and it may be useful to prioritize in some way.
>
> Technically, this would be accomplished by:
>>
>> * Switching to MongoDB for storage
>> * Re-building the UI in AngularJS
>>
>> The advantages of this approach would be:
>>
>> * Creates links between datasets, aiming towards a flexible, re-usable
>> entity namespace.
>> * Provide a richer set of entities to cluster with, thus hopefully better
>> data integration.
>> * Could more easily serve as a backend to publicbodies.org
>>
>
>> I'm keen to hear what people thing about this kind of plan, and if anyone
>> wants to contribute to such an effort - or knows about existing efforts
>> that this could pair up with!
>>
>
> Very excited to see these new developments and will aim to contribute
> where I can :-)
>
> Rufus
>
>
>>
>> Cheers,
>>
>> - Friedrich
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>
>
> --
>
>
> * Rufus Pollock Founder and Executive Director | skype: rufuspollock |
> @rufuspollock <https://twitter.com/rufuspollock> The Open Knowledge
> Foundation <http://okfn.org/> Empowering through Open Knowledge
> http://okfn.org/ <http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF
> on Facebook <https://www.facebook.com/OKFNetwork> |  Blog
> <http://blog.okfn.org/>  |  Newsletter <http://okfn.org/about/newsletter> *
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131113/80c6bee0/attachment-0004.html>