[okfn-labs] Entity reconciliation services

Tom Morris tfmorris at gmail.com
Fri Jan 15 19:45:14 UTC 2016


On Fri, Jan 15, 2016 at 2:13 AM, Paul Walsh <paulywalsh at gmail.com> wrote:

> On 14 Jan 2016, at 7:29 PM, Tom Morris <tfmorris at gmail.com> wrote:
>
> Is it safe to assume that you've already reviewed:
> https://developers.google.com/freebase/v1/reconciliation-overview?hl=en
>
> Hasn’t this been replaced by the read-only Knowledge Graph API?
>
> https://developers.google.com/knowledge-graph/
>
> If not please let me know otherwise. But, it seems like it could be a good
> data source for the type of thing we are seeking, but not a candidate for
> the thing itself.
>

As far as I know, Google hasn't announced any plans to produce a KG version
of the Reconciliation service.  The only two Freebase APIs being carried
forward (that I've heard of) are Search and Suggest.

The pointer was more for functionality, form, and features of the service
than as a potential candidate to use. As an aside, there's also no good
Wikidata-based alternative available or planned, that I've heard of.

I’m not sure what distinguishes a tool and service here for you, but I’d
> say we are thinking in terms of Nomenklatura as a starting point, so:
>
> 1. An open source app
> 2. A hosted service of the same
>

Using the same name for two different things leads to confusion and
fuzziness (vis CKAN).  An app and a hosted service are two distinct things
with different characteristics and requirements.  For example, a piece of
software doesn't have an SLA or a list of available datasets in the way a
hosted service does.

> Some things like brokered reconciliation to existing reconciliation
> services (e.g. OpenCorporates) sound needlessly complex.
>
> I think that is a matter of managing some complexity in an service like
> this, so that several apps consuming such a service do not have to write
> similar code to manage complexity.
>

Without knowing what value the broker adds, it's hard to comment, so I'll
reserve judgement, but suffice it to say, I'm skeptical that this will turn
out to be a good idea.


> There are plenty of open source implementations of reconciliation
> services, but the problem with all of the ones that I'm familiar with is
> that they have very primitive/simple scoring mechanisms (prefix match, edit
> distance, etc).
>
> If there are plenty that match a decent number of things we’ve listed
> here, it would be great to hear about them.
>

There are close to twenty services listed here:
https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources


> Another thing to consider is tabular vs textual entity identification.  In
> the medical domain it's not uncommon to have textual notes that you'd like
> to identify drugs, procedures, etc in.  The surrounding textual context in
> these cases provides useful information to help identify entities.
>
> Yes, this is clearly a need we have in OpenTrials, and for part of the
> work there we’ll be using ContentMine ( http://contentmine.org ). In
> terms of how textual content can relate to a more general entity
> reconciliation service as we are describing here, I’m just not sure yet.
>

I would encourage you to survey available options before settling on
something just because it's familiar and nearby.  The ContentMine folks are
keen, but there are a lot of other very knowledgeable folks working in this
space.  As just one example, the i2b2 community has hosted a half dozen NLP
challenges in the medical domain with twenty teams
<http://www.j-biomed-inform.com/article/S1532-0464(15)00140-9/fulltext>
competing in the 2014 incarnation <https://www.i2b2.org/NLP/HeartDisease/>.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160115/b4c21693/attachment-0004.html>


More information about the okfn-labs mailing list