[okfn-labs] Entity reconciliation services

Sun Jan 17 12:27:47 UTC 2016

Hey Tom, 

many thanks for chiming in, really appreciate your input on this.

The discussion we had was very much around providing a stable and re-useable implementation of the Recon API. Apps like OpenSpending would then implement a recon API client against their internal data store (e.g. a budget dataset), and then allow matching against various API endpoints (e.g. for budget taxonomies, but also OC for companies). I'd imagine that we’ll have to extend it to add some basic write support (i.e. “nothing exists to reconcile against, so go create a new entry”).

So the question is also: who’s done really great work implementing this, or is it worth undusting and polishing up nomenklatura a whole lot. Looking at the (amazing!) list on the OpenRefine repo, I’d actually say that may be worth putting some time into. 

On hosted vs. software, I’ve learned my lesson: unless there’s some super-solid institutional commitment, I feel hosted stuff is just weekend suicide. Thanks to Docker & co, running stuff has become easier, too. 

Finally, I’m trying to think through different domain models for this: 

https://github.com/pudo/nomenklatura/blob/master/DESIGN.md <https://github.com/pudo/nomenklatura/blob/master/DESIGN.md>

Would love any feedback from the list! 

- Friedrich 

> On 15 Jan 2016, at 20:45, Tom Morris <tfmorris at gmail.com> wrote:
> 
> On Fri, Jan 15, 2016 at 2:13 AM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
>> On 14 Jan 2016, at 7:29 PM, Tom Morris <tfmorris at gmail.com <mailto:tfmorris at gmail.com>> wrote:
>> 
>> Is it safe to assume that you've already reviewed:
>> https://developers.google.com/freebase/v1/reconciliation-overview?hl=en <https://developers.google.com/freebase/v1/reconciliation-overview?hl=en>
> 
> Hasn’t this been replaced by the read-only Knowledge Graph API? 
> 
> https://developers.google.com/knowledge-graph/ <https://developers.google.com/knowledge-graph/>
> 
> If not please let me know otherwise. But, it seems like it could be a good data source for the type of thing we are seeking, but not a candidate for the thing itself.
> 
> As far as I know, Google hasn't announced any plans to produce a KG version of the Reconciliation service.  The only two Freebase APIs being carried forward (that I've heard of) are Search and Suggest.
> 
> The pointer was more for functionality, form, and features of the service than as a potential candidate to use. As an aside, there's also no good Wikidata-based alternative available or planned, that I've heard of.
> 
> I’m not sure what distinguishes a tool and service here for you, but I’d say we are thinking in terms of Nomenklatura as a starting point, so:
> 
> 1. An open source app
> 2. A hosted service of the same
> 
> Using the same name for two different things leads to confusion and fuzziness (vis CKAN).  An app and a hosted service are two distinct things with different characteristics and requirements.  For example, a piece of software doesn't have an SLA or a list of available datasets in the way a hosted service does.
>> Some things like brokered reconciliation to existing reconciliation services (e.g. OpenCorporates) sound needlessly complex.
> I think that is a matter of managing some complexity in an service like this, so that several apps consuming such a service do not have to write similar code to manage complexity.
> 
> Without knowing what value the broker adds, it's hard to comment, so I'll reserve judgement, but suffice it to say, I'm skeptical that this will turn out to be a good idea.
>  
>> There are plenty of open source implementations of reconciliation services, but the problem with all of the ones that I'm familiar with is that they have very primitive/simple scoring mechanisms (prefix match, edit distance, etc). 
> If there are plenty that match a decent number of things we’ve listed here, it would be great to hear about them.
> 
> There are close to twenty services listed here: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources <https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources>
>  
>> Another thing to consider is tabular vs textual entity identification.  In the medical domain it's not uncommon to have textual notes that you'd like to identify drugs, procedures, etc in.  The surrounding textual context in these cases provides useful information to help identify entities.
> Yes, this is clearly a need we have in OpenTrials, and for part of the work there we’ll be using ContentMine ( http://contentmine.org <http://contentmine.org/> ). In terms of how textual content can relate to a more general entity reconciliation service as we are describing here, I’m just not sure yet.
> 
> I would encourage you to survey available options before settling on something just because it's familiar and nearby.  The ContentMine folks are keen, but there are a lot of other very knowledgeable folks working in this space.  As just one example, the i2b2 community has hosted a half dozen NLP challenges in the medical domain with twenty teams <http://www.j-biomed-inform.com/article/S1532-0464(15)00140-9/fulltext> competing in the 2014 incarnation <https://www.i2b2.org/NLP/HeartDisease/>.
> 
> Tom
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160117/5eda078e/attachment-0004.html>