[okfn-labs] Entity reconciliation services

Thu Jan 14 17:29:01 UTC 2016

Is it safe to assume that you've already reviewed:

https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-Api
https://developers.google.com/freebase/v1/reconciliation-overview?hl=en

The feature list is a bit terse to decipher without the context of the
discussion that generated it.  Will the bullet points be expanded?  Is it a
tool, a service, or both?  Some things like brokered reconciliation to
existing reconciliation services (e.g. OpenCorporates) sound needlessly
complex.

There are plenty of open source implementations of reconciliation services,
but the problem with all of the ones that I'm familiar with is that they
have very primitive/simple scoring mechanisms (prefix match, edit distance,
etc).  They also typically only take a single attribute (ie column in your
spreadsheet) when you can often get much more powerful scoring using
multiple columns (e.g. name, occupation, birth date, nationality, etc).

Another thing to consider is tabular vs textual entity identification.  In
the medical domain it's not uncommon to have textual notes that you'd like
to identify drugs, procedures, etc in.  The surrounding textual context in
these cases provides useful information to help identify entities.

Data curation is a key component, so I'm a little dubious about the
Nomenklatura "dump your data here" approach.  I think it's much more
successful to have a dedicated curated data source whether it be
domain-specific like MusicBrainz, IMDB, OpenCorporates (which is actually
an aggregator of individually curated data sets from various registration
authorities), etc or general like WikiData, Freebase, etc.

Tom

On Thu, Jan 14, 2016 at 10:21 AM, Paul Walsh <paulywalsh at gmail.com> wrote:

> There has been recent discussion around OpenSpending and OpenTrials (two
> projects at Open Knowledge International) on the need for a solid and well
> featured entity reconciliation service.
>
> The service would help applications which depend on reference data, from
> country lists, company lists to budget classifications. Examples would be
> messy source data about party donations, procurement awards, or medicine
> names.
>
> The service would provide support for de-duplication and re-classification
> of source data dimensions against the canonical reference data; and it
> would allow the construction of canonical lists from messy source data.
>
> Such a service would be generally useful to the wider open data community,
> and in initial discussion between Friedrich Lindenberg, Mark Brough and
> Paul Walsh, we came to some shared understanding of what a service might
> look like at a high level.
>
>
> To learn more about how others have approached this problem, we're putting
> out a call: We are looking for existing work to build on, open-source tools
> for reference data. Is there open source code out there that meets many or
> all of our criteria? If no existing solution can be found, we hack on
> Nomenklatura (https://github.com/pudo/nomenklatura) to push it in this
> direction.
>
> Features:
>
>         • Reconciliation endpoints for particular "collections"
>           • Geographical
>           • Budget taxonomies
>           • Companies
>         • Namespacing of data
>           • "collections" is a type of namespacing
>           • but collections need (?) additional context: such as
> geographical context for company names
>         • Distinct reconciliation strategies (possibly exposed as distinct
> methods of the API)
>           • Fuzzy, cross field matching
>           • Primary identifer matching
>           • Other?
>         • Read and write against "collections"
>         • Create the code list based on the data being reconciled ("get or
> create")
>         • Confidence level for matches
>         • Some control over confidence level ("give me the first match
> over 80% confidence")
>         • Hook into an array of data stores to match against, possibly
> mapped to "collections"
>           • web services (example: opencorporates)
>           • CSV (hosted somewhere)
>           • Other databases (connection with credentials)?
>         • Make higher level abstractions out of multiple data sources
>           • Example: automate the creation of a geo lookup service by
> mapping ocd division ids (
> https://github.com/opencivicdata/ocd-division-ids) onto data from genomes
> (??)
>         • Simple, modern web client for user-driven reconciliation of data
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160114/8f528ba7/attachment-0004.html>