[okfn-labs] Entity reconciliation services

Fri Jan 15 07:13:35 UTC 2016

Hi Tom,

> On 14 Jan 2016, at 7:29 PM, Tom Morris <tfmorris at gmail.com> wrote:
> 
> Is it safe to assume that you've already reviewed:
> 
> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-Api <https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-Api>

Yes.

> https://developers.google.com/freebase/v1/reconciliation-overview?hl=en <https://developers.google.com/freebase/v1/reconciliation-overview?hl=en>

Hasn’t this been replaced by the read-only Knowledge Graph API? 

https://developers.google.com/knowledge-graph/

If not please let me know otherwise. But, it seems like it could be a good data source for the type of thing we are seeking, but not a candidate for the thing itself.

> 
> The feature list is a bit terse to decipher without the context of the discussion that generated it.  Will the bullet points be expanded?  Is it a tool, a service, or both? 

We can expand on it together, in this thread. The idea here is to get input from others who have more expertise in this area than some or all of us who started the discussion.

I’m not sure what distinguishes a tool and service here for you, but I’d say we are thinking in terms of Nomenklatura as a starting point, so:

1. An open source app
2. A hosted service of the same

> Some things like brokered reconciliation to existing reconciliation services (e.g. OpenCorporates) sound needlessly complex.

I think that is a matter of managing some complexity in an service like this, so that several apps consuming such a service do not have to write similar code to manage complexity.

> 
> There are plenty of open source implementations of reconciliation services, but the problem with all of the ones that I'm familiar with is that they have very primitive/simple scoring mechanisms (prefix match, edit distance, etc). 

If there are plenty that match a decent number of things we’ve listed here, it would be great to hear about them.

> They also typically only take a single attribute (ie column in your spreadsheet) when you can often get much more powerful scoring using multiple columns (e.g. name, occupation, birth date, nationality, etc).

Yes, this is pretty crucial.

> 
> Another thing to consider is tabular vs textual entity identification.  In the medical domain it's not uncommon to have textual notes that you'd like to identify drugs, procedures, etc in.  The surrounding textual context in these cases provides useful information to help identify entities.

Yes, this is clearly a need we have in OpenTrials, and for part of the work there we’ll be using ContentMine ( http://contentmine.org ). In terms of how textual content can relate to a more general entity reconciliation service as we are describing here, I’m just not sure yet.

Best,

Paul

> Data curation is a key component, so I'm a little dubious about the Nomenklatura "dump your data here" approach.
> I think it's much more successful to have a dedicated curated data source whether it be domain-specific like MusicBrainz, IMDB, OpenCorporates (which is actually an aggregator of individually curated data sets from various registration authorities), etc or general like WikiData, Freebase, etc.
> 
> Tom
> 
> On Thu, Jan 14, 2016 at 10:21 AM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
> There has been recent discussion around OpenSpending and OpenTrials (two projects at Open Knowledge International) on the need for a solid and well featured entity reconciliation service.
> 
> The service would help applications which depend on reference data, from country lists, company lists to budget classifications. Examples would be messy source data about party donations, procurement awards, or medicine names.
> 
> The service would provide support for de-duplication and re-classification of source data dimensions against the canonical reference data; and it would allow the construction of canonical lists from messy source data.
> 
> Such a service would be generally useful to the wider open data community, and in initial discussion between Friedrich Lindenberg, Mark Brough and Paul Walsh, we came to some shared understanding of what a service might look like at a high level.
> 
> 
> To learn more about how others have approached this problem, we're putting out a call: We are looking for existing work to build on, open-source tools for reference data. Is there open source code out there that meets many or all of our criteria? If no existing solution can be found, we hack on Nomenklatura (https://github.com/pudo/nomenklatura <https://github.com/pudo/nomenklatura>) to push it in this direction.
> 
> Features:
> 
>         • Reconciliation endpoints for particular "collections"
>           • Geographical
>           • Budget taxonomies
>           • Companies
>         • Namespacing of data
>           • "collections" is a type of namespacing
>           • but collections need (?) additional context: such as geographical context for company names
>         • Distinct reconciliation strategies (possibly exposed as distinct methods of the API)
>           • Fuzzy, cross field matching
>           • Primary identifer matching
>           • Other?
>         • Read and write against "collections"
>         • Create the code list based on the data being reconciled ("get or create")
>         • Confidence level for matches
>         • Some control over confidence level ("give me the first match over 80% confidence")
>         • Hook into an array of data stores to match against, possibly mapped to "collections"
>           • web services (example: opencorporates)
>           • CSV (hosted somewhere)
>           • Other databases (connection with credentials)?
>         • Make higher level abstractions out of multiple data sources
>           • Example: automate the creation of a geo lookup service by mapping ocd division ids (https://github.com/opencivicdata/ocd-division-ids <https://github.com/opencivicdata/ocd-division-ids>) onto data from genomes (??)
>         • Simple, modern web client for user-driven reconciliation of data
> 
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
> https://lists.okfn.org/mailman/listinfo/okfn-labs <https://lists.okfn.org/mailman/listinfo/okfn-labs>
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs <https://lists.okfn.org/mailman/options/okfn-labs>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160115/52215602/attachment-0004.html>