[okfn-labs] Entity reconciliation services

Sun Jan 24 16:02:51 UTC 2016

Hi Tom,

Took a bit to respond to this one, but thanks very much for your input here. Some comments inline.

> On 18 Jan 2016, at 6:30 PM, Tom Morris <tfmorris at gmail.com> wrote:
> 
> Once you start talking about write support, you've opened a whole 'nother can of worms.  If you just offer reconciliation services to someone else's curated read-only data, you can avoid that whole mess.

I understand that, but the the type of use cases we have (and I’m positive that we are not alone) call for some type of curation of custom lists. We can’t work from the assumption that the work has already been done by experts and is already out there. I agree on the can of worms, but it is something that we have to address in some way, anyway. 

> 
> If you look at the (former) Freebase ecosystem, the reconciliation API was just a small piece.  There was also:
> 
> - a query API to extend a data set by adding new columns based on properties of reconciled entities after reconciliation has taken place
> - a namespaced identifier reconciliation service (somewhat) separate from the main service which did strict ID lookup in almost 100 different identifier namespaces here https://www.freebase.com/authority <https://www.freebase.com/authority>
> - a backend data loading service called Refinery http://wiki.freebase.com/wiki/Refinery <http://wiki.freebase.com/wiki/Refinery> which allowed users to both add new entities and add property values to reconciled entities
> - a web interface to flag pairs of entities for merge and single entities to be split, along with a voting pipeline to allow people to vote on the flags, and an escalation process for admins to resolve conflicting votes.
> 
> The data loading process was roughly:
> - reconcile spreadsheet against Freebase
> - map columns of spreadsheet to graph-based schema of Freebase
> - do a trial upload to debug the first two steps
> - when everything is good, upload to QA queue
> - QA process samples the data set (at 95% confidence interval) and creates queue of items for multiple voters to assess the quality of the sample items
> - If the sample meets the quality threshold (>99% good), the whole dataset is loaded into Freebase
> - as part of the load, each triple is tagged with its provenance including data source, tool chain used, operator of tool, etc.
> 
> After the bulk data load, any individual problems found are cleaned up using the web interface (typically merging unreconciled duplicates).  There was a ton of machinery, backed by a significant pool of paid microtask labor to make that whole system work.

Excellent overview! Definitely a ton of machinery involved, and some real food for thought. I’m sure we can simplify on that somewhat with a particular focus on certain domains of knowledge and engage the right people. I’m not sure I see here any clear reason *not* too:

* Expand on Nomenklatura (I still see no clear alternative that we can hack on towards the things we need)
* Host instance(s) of it that help us create new sets of curated data based specifically on the areas we are working in (fiscal and medical for OKI now, Friedrich and others have additional use cases)

> 
> Wikidata has a much simpler mechanism which relies more on having a large quantity of human labor available, but I predict that it's going to result in lower quality until they implement a more robust infrastructure.
> 
> Maintainers of other registries such as Companies House <https://www.gov.uk/government/organisations/companies-house> and the International Plant Names Index (IPNI) <http://www.ipni.org/> manage by having a small, focused domain which is curated by subject matter experts.
> 
> I suppose you could model a use case for software which could be used by IPNI et al, but it seems like a huge piece of work to bite off.

We need to find a balance between building some huge mother of all reconciliation app, which we can’t do, and building something that does meet our actual needs right now. Did you get a chance to see the doc that Friedrich prepped here: https://github.com/pudo/nomenklatura/blob/master/DESIGN.md ?

And, about the earlier comments on other efforts in this space, particularly in the medical domain: thanks for the links, I’m going through them now.

Paul

> 
> Tom
> 
> On Sun, Jan 17, 2016 at 7:24 AM, Friedrich Lindenberg <friedrich.lindenberg at gmail.com <mailto:friedrich.lindenberg at gmail.com>> wrote:
> Hey Tom, 
> 
> many thanks for chiming in, really appreciate your input on this.
> 
> The discussion we had was very much around providing a stable and re-useable implementation of the Recon API. Apps like OpenSpending would then implement a recon API client against their internal data store (e.g. a budget dataset), and then allow matching against various API endpoints (e.g. for budget taxonomies, but also OC for companies). I'd imagine that we’ll have to extend it to add some basic write support (i.e. “nothing exists to reconcile against, so go create a new entry”).
> 
> So the question is also: who’s done really great work implementing this, or is it worth undusting and polishing up nomenklatura a whole lot. Looking at the (amazing!) list on the OpenRefine repo, I’d actually say that may be worth putting some time into. 
> 
> On hosted vs. software, I’ve learned my lesson: unless there’s some super-solid institutional commitment, I feel hosted stuff is just weekend suicide. Thanks to Docker & co, running stuff has become easier, too. 
> 
> Finally, I’m trying to think through different domain models for this: 
> 
> https://github.com/pudo/nomenklatura/blob/master/DESIGN.md <https://github.com/pudo/nomenklatura/blob/master/DESIGN.md>
> 
> Would love any feedback from the list! 
> 
> - Friedrich 
> 
> 
> 
>> On 15 Jan 2016, at 20:45, Tom Morris <tfmorris at gmail.com <mailto:tfmorris at gmail.com>> wrote:
>> 
>> On Fri, Jan 15, 2016 at 2:13 AM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
>>> On 14 Jan 2016, at 7:29 PM, Tom Morris <tfmorris at gmail.com <mailto:tfmorris at gmail.com>> wrote:
>>> 
>>> Is it safe to assume that you've already reviewed:
>>> https://developers.google.com/freebase/v1/reconciliation-overview?hl=en <https://developers.google.com/freebase/v1/reconciliation-overview?hl=en>
>> 
>> Hasn’t this been replaced by the read-only Knowledge Graph API? 
>> 
>> https://developers.google.com/knowledge-graph/ <https://developers.google.com/knowledge-graph/>
>> 
>> If not please let me know otherwise. But, it seems like it could be a good data source for the type of thing we are seeking, but not a candidate for the thing itself.
>> 
>> As far as I know, Google hasn't announced any plans to produce a KG version of the Reconciliation service.  The only two Freebase APIs being carried forward (that I've heard of) are Search and Suggest.
>> 
>> The pointer was more for functionality, form, and features of the service than as a potential candidate to use. As an aside, there's also no good Wikidata-based alternative available or planned, that I've heard of.
>> 
>> I’m not sure what distinguishes a tool and service here for you, but I’d say we are thinking in terms of Nomenklatura as a starting point, so:
>> 
>> 1. An open source app
>> 2. A hosted service of the same
>> 
>> Using the same name for two different things leads to confusion and fuzziness (vis CKAN).  An app and a hosted service are two distinct things with different characteristics and requirements.  For example, a piece of software doesn't have an SLA or a list of available datasets in the way a hosted service does.
>>> Some things like brokered reconciliation to existing reconciliation services (e.g. OpenCorporates) sound needlessly complex.
>> I think that is a matter of managing some complexity in an service like this, so that several apps consuming such a service do not have to write similar code to manage complexity.
>> 
>> Without knowing what value the broker adds, it's hard to comment, so I'll reserve judgement, but suffice it to say, I'm skeptical that this will turn out to be a good idea.
>>  
>>> There are plenty of open source implementations of reconciliation services, but the problem with all of the ones that I'm familiar with is that they have very primitive/simple scoring mechanisms (prefix match, edit distance, etc). 
>> If there are plenty that match a decent number of things we’ve listed here, it would be great to hear about them.
>> 
>> There are close to twenty services listed here: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources <https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources>
>>  
>>> Another thing to consider is tabular vs textual entity identification.  In the medical domain it's not uncommon to have textual notes that you'd like to identify drugs, procedures, etc in.  The surrounding textual context in these cases provides useful information to help identify entities.
>> Yes, this is clearly a need we have in OpenTrials, and for part of the work there we’ll be using ContentMine ( http://contentmine.org <http://contentmine.org/> ). In terms of how textual content can relate to a more general entity reconciliation service as we are describing here, I’m just not sure yet.
>> 
>> I would encourage you to survey available options before settling on something just because it's familiar and nearby.  The ContentMine folks are keen, but there are a lot of other very knowledgeable folks working in this space.  As just one example, the i2b2 community has hosted a half dozen NLP challenges in the medical domain with twenty teams <http://www.j-biomed-inform.com/article/S1532-0464(15)00140-9/fulltext> competing in the 2014 incarnation <https://www.i2b2.org/NLP/HeartDisease/>.
>> 
>> Tom
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
>> https://lists.okfn.org/mailman/listinfo/okfn-labs <https://lists.okfn.org/mailman/listinfo/okfn-labs>
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs <https://lists.okfn.org/mailman/options/okfn-labs>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160124/430ae08c/attachment-0004.html>