[okfn-labs] Entity reconciliation services
tfmorris at gmail.com
Mon Jan 18 16:30:43 UTC 2016
Once you start talking about write support, you've opened a whole 'nother
can of worms. If you just offer reconciliation services to someone else's
curated read-only data, you can avoid that whole mess.
If you look at the (former) Freebase ecosystem, the reconciliation API was
just a small piece. There was also:
- a query API to extend a data set by adding new columns based on
properties of reconciled entities after reconciliation has taken place
- a namespaced identifier reconciliation service (somewhat) separate from
the main service which did strict ID lookup in almost 100 different
identifier namespaces here https://www.freebase.com/authority
- a backend data loading service called Refinery
http://wiki.freebase.com/wiki/Refinery which allowed users to both add new
entities and add property values to reconciled entities
- a web interface to flag pairs of entities for merge and single entities
to be split, along with a voting pipeline to allow people to vote on the
flags, and an escalation process for admins to resolve conflicting votes.
The data loading process was roughly:
- reconcile spreadsheet against Freebase
- map columns of spreadsheet to graph-based schema of Freebase
- do a trial upload to debug the first two steps
- when everything is good, upload to QA queue
- QA process samples the data set (at 95% confidence interval) and creates
queue of items for multiple voters to assess the quality of the sample items
- If the sample meets the quality threshold (>99% good), the whole dataset
is loaded into Freebase
- as part of the load, each triple is tagged with its provenance including
data source, tool chain used, operator of tool, etc.
After the bulk data load, any individual problems found are cleaned up
using the web interface (typically merging unreconciled duplicates). There
was a *ton *of machinery, backed by a significant pool of paid microtask
labor to make that whole system work.
Wikidata has a much simpler mechanism which relies more on having a large
quantity of human labor available, but I predict that it's going to result
in lower quality until they implement a more robust infrastructure.
Maintainers of other registries such as Companies House
<https://www.gov.uk/government/organisations/companies-house> and the
Plant Names Index (IPNI) <http://www.ipni.org/> manage by having a small,
focused domain which is curated by subject matter experts.
I suppose you could model a use case for software which could be used by
IPNI et al, but it seems like a huge piece of work to bite off.
On Sun, Jan 17, 2016 at 7:24 AM, Friedrich Lindenberg <
friedrich.lindenberg at gmail.com> wrote:
> Hey Tom,
> many thanks for chiming in, really appreciate your input on this.
> The discussion we had was very much around providing a stable and
> re-useable implementation of the Recon API. Apps like OpenSpending would
> then implement a recon API client against their internal data store (e.g. a
> budget dataset), and then allow matching against various API endpoints
> (e.g. for budget taxonomies, but also OC for companies). I'd imagine that
> we’ll have to extend it to add some basic write support (i.e. “nothing
> exists to reconcile against, so go create a new entry”).
> So the question is also: who’s done really great work implementing this,
> or is it worth undusting and polishing up nomenklatura a whole lot. Looking
> at the (amazing!) list on the OpenRefine repo, I’d actually say that may be
> worth putting some time into.
> On hosted vs. software, I’ve learned my lesson: unless there’s some
> super-solid institutional commitment, I feel hosted stuff is just weekend
> suicide. Thanks to Docker & co, running stuff has become easier, too.
> Finally, I’m trying to think through different domain models for this:
> Would love any feedback from the list!
> - Friedrich
> On 15 Jan 2016, at 20:45, Tom Morris <tfmorris at gmail.com> wrote:
> On Fri, Jan 15, 2016 at 2:13 AM, Paul Walsh <paulywalsh at gmail.com> wrote:
>> On 14 Jan 2016, at 7:29 PM, Tom Morris <tfmorris at gmail.com> wrote:
>> Is it safe to assume that you've already reviewed:
>> Hasn’t this been replaced by the read-only Knowledge Graph API?
>> If not please let me know otherwise. But, it seems like it could be a
>> good data source for the type of thing we are seeking, but not a candidate
>> for the thing itself.
> As far as I know, Google hasn't announced any plans to produce a KG
> version of the Reconciliation service. The only two Freebase APIs being
> carried forward (that I've heard of) are Search and Suggest.
> The pointer was more for functionality, form, and features of the service
> than as a potential candidate to use. As an aside, there's also no good
> Wikidata-based alternative available or planned, that I've heard of.
> I’m not sure what distinguishes a tool and service here for you, but I’d
>> say we are thinking in terms of Nomenklatura as a starting point, so:
>> 1. An open source app
>> 2. A hosted service of the same
> Using the same name for two different things leads to confusion and
> fuzziness (vis CKAN). An app and a hosted service are two distinct things
> with different characteristics and requirements. For example, a piece of
> software doesn't have an SLA or a list of available datasets in the way a
> hosted service does.
>> Some things like brokered reconciliation to existing reconciliation
>> services (e.g. OpenCorporates) sound needlessly complex.
>> I think that is a matter of managing some complexity in an service like
>> this, so that several apps consuming such a service do not have to write
>> similar code to manage complexity.
> Without knowing what value the broker adds, it's hard to comment, so I'll
> reserve judgement, but suffice it to say, I'm skeptical that this will turn
> out to be a good idea.
>> There are plenty of open source implementations of reconciliation
>> services, but the problem with all of the ones that I'm familiar with is
>> that they have very primitive/simple scoring mechanisms (prefix match, edit
>> distance, etc).
>> If there are plenty that match a decent number of things we’ve listed
>> here, it would be great to hear about them.
> There are close to twenty services listed here:
>> Another thing to consider is tabular vs textual entity identification.
>> In the medical domain it's not uncommon to have textual notes that you'd
>> like to identify drugs, procedures, etc in. The surrounding textual
>> context in these cases provides useful information to help identify
>> Yes, this is clearly a need we have in OpenTrials, and for part of the
>> work there we’ll be using ContentMine ( http://contentmine.org ). In
>> terms of how textual content can relate to a more general entity
>> reconciliation service as we are describing here, I’m just not sure yet.
> I would encourage you to survey available options before settling on
> something just because it's familiar and nearby. The ContentMine folks are
> keen, but there are a lot of other very knowledgeable folks working in this
> space. As just one example, the i2b2 community has hosted a half dozen NLP
> challenges in the medical domain with twenty teams
> competing in the 2014 incarnation <https://www.i2b2.org/NLP/HeartDisease/>
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the okfn-labs