[okfn-labs] Entity reconciliation services

Wed Jan 20 16:06:52 UTC 2016

She has written a number of OpenRefine reconciliation services as well:

https://github.com/search?q=user%3Acmh2166+recon

Tom

On Wed, Jan 20, 2016 at 5:36 AM, Paul Walsh <paulywalsh at gmail.com> wrote:

> Here is an interesting code base someone just alerted us to at
> https://discuss.okfn.org/t/entity-reconciliation-services/1785/4
>
> https://github.com/cmh2166/GetUrRecon
>
>
> On 18 Jan 2016, at 6:30 PM, Tom Morris <tfmorris at gmail.com> wrote:
>
> Once you start talking about write support, you've opened a whole 'nother
> can of worms.  If you just offer reconciliation services to someone else's
> curated read-only data, you can avoid that whole mess.
>
> If you look at the (former) Freebase ecosystem, the reconciliation API was
> just a small piece.  There was also:
>
> - a query API to extend a data set by adding new columns based on
> properties of reconciled entities after reconciliation has taken place
> - a namespaced identifier reconciliation service (somewhat) separate from
> the main service which did strict ID lookup in almost 100 different
> identifier namespaces here https://www.freebase.com/authority
> - a backend data loading service called Refinery
> http://wiki.freebase.com/wiki/Refinery which allowed users to both add
> new entities and add property values to reconciled entities
> - a web interface to flag pairs of entities for merge and single entities
> to be split, along with a voting pipeline to allow people to vote on the
> flags, and an escalation process for admins to resolve conflicting votes.
>
> The data loading process was roughly:
> - reconcile spreadsheet against Freebase
> - map columns of spreadsheet to graph-based schema of Freebase
> - do a trial upload to debug the first two steps
> - when everything is good, upload to QA queue
> - QA process samples the data set (at 95% confidence interval) and creates
> queue of items for multiple voters to assess the quality of the sample items
> - If the sample meets the quality threshold (>99% good), the whole dataset
> is loaded into Freebase
> - as part of the load, each triple is tagged with its provenance including
> data source, tool chain used, operator of tool, etc.
>
> After the bulk data load, any individual problems found are cleaned up
> using the web interface (typically merging unreconciled duplicates).  There
> was a *ton *of machinery, backed by a significant pool of paid microtask
> labor to make that whole system work.
>
> Wikidata has a much simpler mechanism which relies more on having a large
> quantity of human labor available, but I predict that it's going to result
> in lower quality until they implement a more robust infrastructure.
>
> Maintainers of other registries such as Companies House
> <https://www.gov.uk/government/organisations/companies-house> and the International
> Plant Names Index (IPNI) <http://www.ipni.org/> manage by having a small,
> focused domain which is curated by subject matter experts.
>
> I suppose you could model a use case for software which could be used by
> IPNI et al, but it seems like a huge piece of work to bite off.
>
> Tom
>
> On Sun, Jan 17, 2016 at 7:24 AM, Friedrich Lindenberg <
> friedrich.lindenberg at gmail.com> wrote:
>
>> Hey Tom,
>>
>> many thanks for chiming in, really appreciate your input on this.
>>
>> The discussion we had was very much around providing a stable and
>> re-useable implementation of the Recon API. Apps like OpenSpending would
>> then implement a recon API client against their internal data store (e.g. a
>> budget dataset), and then allow matching against various API endpoints
>> (e.g. for budget taxonomies, but also OC for companies). I'd imagine that
>> we’ll have to extend it to add some basic write support (i.e. “nothing
>> exists to reconcile against, so go create a new entry”).
>>
>> So the question is also: who’s done really great work implementing this,
>> or is it worth undusting and polishing up nomenklatura a whole lot. Looking
>> at the (amazing!) list on the OpenRefine repo, I’d actually say that may be
>> worth putting some time into.
>>
>> On hosted vs. software, I’ve learned my lesson: unless there’s some
>> super-solid institutional commitment, I feel hosted stuff is just weekend
>> suicide. Thanks to Docker & co, running stuff has become easier, too.
>>
>> Finally, I’m trying to think through different domain models for this:
>>
>> https://github.com/pudo/nomenklatura/blob/master/DESIGN.md
>>
>> Would love any feedback from the list!
>>
>> - Friedrich
>>
>>
>>
>> On 15 Jan 2016, at 20:45, Tom Morris <tfmorris at gmail.com> wrote:
>>
>> On Fri, Jan 15, 2016 at 2:13 AM, Paul Walsh <paulywalsh at gmail.com> wrote:
>>
>>> On 14 Jan 2016, at 7:29 PM, Tom Morris <tfmorris at gmail.com> wrote:
>>>
>>> Is it safe to assume that you've already reviewed:
>>> https://developers.google.com/freebase/v1/reconciliation-overview?hl=en
>>>
>>> Hasn’t this been replaced by the read-only Knowledge Graph API?
>>>
>>> https://developers.google.com/knowledge-graph/
>>>
>>> If not please let me know otherwise. But, it seems like it could be a
>>> good data source for the type of thing we are seeking, but not a candidate
>>> for the thing itself.
>>>
>>
>> As far as I know, Google hasn't announced any plans to produce a KG
>> version of the Reconciliation service.  The only two Freebase APIs being
>> carried forward (that I've heard of) are Search and Suggest.
>>
>> The pointer was more for functionality, form, and features of the service
>> than as a potential candidate to use. As an aside, there's also no good
>> Wikidata-based alternative available or planned, that I've heard of.
>>
>> I’m not sure what distinguishes a tool and service here for you, but I’d
>>> say we are thinking in terms of Nomenklatura as a starting point, so:
>>>
>>> 1. An open source app
>>> 2. A hosted service of the same
>>>
>>
>> Using the same name for two different things leads to confusion and
>> fuzziness (vis CKAN).  An app and a hosted service are two distinct things
>> with different characteristics and requirements.  For example, a piece of
>> software doesn't have an SLA or a list of available datasets in the way a
>> hosted service does.
>>
>>> Some things like brokered reconciliation to existing reconciliation
>>> services (e.g. OpenCorporates) sound needlessly complex.
>>>
>>> I think that is a matter of managing some complexity in an service like
>>> this, so that several apps consuming such a service do not have to write
>>> similar code to manage complexity.
>>>
>>
>> Without knowing what value the broker adds, it's hard to comment, so I'll
>> reserve judgement, but suffice it to say, I'm skeptical that this will turn
>> out to be a good idea.
>>
>>
>>> There are plenty of open source implementations of reconciliation
>>> services, but the problem with all of the ones that I'm familiar with is
>>> that they have very primitive/simple scoring mechanisms (prefix match, edit
>>> distance, etc).
>>>
>>> If there are plenty that match a decent number of things we’ve listed
>>> here, it would be great to hear about them.
>>>
>>
>> There are close to twenty services listed here:
>> https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
>>
>>
>>> Another thing to consider is tabular vs textual entity identification.
>>> In the medical domain it's not uncommon to have textual notes that you'd
>>> like to identify drugs, procedures, etc in.  The surrounding textual
>>> context in these cases provides useful information to help identify
>>> entities.
>>>
>>> Yes, this is clearly a need we have in OpenTrials, and for part of the
>>> work there we’ll be using ContentMine ( http://contentmine.org ). In
>>> terms of how textual content can relate to a more general entity
>>> reconciliation service as we are describing here, I’m just not sure yet.
>>>
>>
>> I would encourage you to survey available options before settling on
>> something just because it's familiar and nearby.  The ContentMine folks are
>> keen, but there are a lot of other very knowledgeable folks working in this
>> space.  As just one example, the i2b2 community has hosted a half dozen NLP
>> challenges in the medical domain with twenty teams
>> <http://www.j-biomed-inform.com/article/S1532-0464(15)00140-9/fulltext>
>> competing in the 2014 incarnation
>> <https://www.i2b2.org/NLP/HeartDisease/>.
>>
>> Tom
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>>
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160120/4c45e0c6/attachment-0004.html>