[okfn-labs] okfn-labs Digest, Vol 57, Issue 1
Om Goeckermann
om at standbytaskforce.com
Tue Feb 23 14:57:46 UTC 2016
It’s not Open, but Watson may be useful. There may be some free resources to test out.
> On Jan 15, 2016, at 2:13 AM, okfn-labs-request at lists.okfn.org wrote:
>
> Send okfn-labs mailing list submissions to
> okfn-labs at lists.okfn.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> or, via email, send a message with subject or body 'help' to
> okfn-labs-request at lists.okfn.org
>
> You can reach the person managing the list at
> okfn-labs-owner at lists.okfn.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of okfn-labs digest..."
>
>
> Today's Topics:
>
> 1. Entity reconciliation services (Paul Walsh)
> 2. Re: Entity reconciliation services (Tom Morris)
> 3. Re: Entity reconciliation services (Paul Walsh)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 14 Jan 2016 17:21:29 +0200
> From: Paul Walsh <paulywalsh at gmail.com>
> To: okfn-labs <okfn-labs at lists.okfn.org>
> Subject: [okfn-labs] Entity reconciliation services
> Message-ID: <215E787C-4D0B-4880-A898-564B76FBC1BA at gmail.com>
> Content-Type: text/plain; charset=utf-8
>
> There has been recent discussion around OpenSpending and OpenTrials (two projects at Open Knowledge International) on the need for a solid and well featured entity reconciliation service.
>
> The service would help applications which depend on reference data, from country lists, company lists to budget classifications. Examples would be messy source data about party donations, procurement awards, or medicine names.
>
> The service would provide support for de-duplication and re-classification of source data dimensions against the canonical reference data; and it would allow the construction of canonical lists from messy source data.
>
> Such a service would be generally useful to the wider open data community, and in initial discussion between Friedrich Lindenberg, Mark Brough and Paul Walsh, we came to some shared understanding of what a service might look like at a high level.
>
>
> To learn more about how others have approached this problem, we're putting out a call: We are looking for existing work to build on, open-source tools for reference data. Is there open source code out there that meets many or all of our criteria? If no existing solution can be found, we hack on Nomenklatura (https://github.com/pudo/nomenklatura) to push it in this direction.
>
> Features:
>
> ? Reconciliation endpoints for particular "collections"
> ? Geographical
> ? Budget taxonomies
> ? Companies
> ? Namespacing of data
> ? "collections" is a type of namespacing
> ? but collections need (?) additional context: such as geographical context for company names
> ? Distinct reconciliation strategies (possibly exposed as distinct methods of the API)
> ? Fuzzy, cross field matching
> ? Primary identifer matching
> ? Other?
> ? Read and write against "collections"
> ? Create the code list based on the data being reconciled ("get or create")
> ? Confidence level for matches
> ? Some control over confidence level ("give me the first match over 80% confidence")
> ? Hook into an array of data stores to match against, possibly mapped to "collections"
> ? web services (example: opencorporates)
> ? CSV (hosted somewhere)
> ? Other databases (connection with credentials)?
> ? Make higher level abstractions out of multiple data sources
> ? Example: automate the creation of a geo lookup service by mapping ocd division ids (https://github.com/opencivicdata/ocd-division-ids) onto data from genomes (??)
> ? Simple, modern web client for user-driven reconciliation of data
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 14 Jan 2016 12:29:01 -0500
> From: Tom Morris <tfmorris at gmail.com>
> To: Paul Walsh <paulywalsh at gmail.com>
> Cc: okfn-labs <okfn-labs at lists.okfn.org>
> Subject: Re: [okfn-labs] Entity reconciliation services
> Message-ID:
> <CAE9vqEEEkMcrtz5RKrO29A5=zizqBgkiy6LjpPO+B8x8UWOFKw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Is it safe to assume that you've already reviewed:
>
> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-Api
> https://developers.google.com/freebase/v1/reconciliation-overview?hl=en
>
> The feature list is a bit terse to decipher without the context of the
> discussion that generated it. Will the bullet points be expanded? Is it a
> tool, a service, or both? Some things like brokered reconciliation to
> existing reconciliation services (e.g. OpenCorporates) sound needlessly
> complex.
>
> There are plenty of open source implementations of reconciliation services,
> but the problem with all of the ones that I'm familiar with is that they
> have very primitive/simple scoring mechanisms (prefix match, edit distance,
> etc). They also typically only take a single attribute (ie column in your
> spreadsheet) when you can often get much more powerful scoring using
> multiple columns (e.g. name, occupation, birth date, nationality, etc).
>
> Another thing to consider is tabular vs textual entity identification. In
> the medical domain it's not uncommon to have textual notes that you'd like
> to identify drugs, procedures, etc in. The surrounding textual context in
> these cases provides useful information to help identify entities.
>
> Data curation is a key component, so I'm a little dubious about the
> Nomenklatura "dump your data here" approach. I think it's much more
> successful to have a dedicated curated data source whether it be
> domain-specific like MusicBrainz, IMDB, OpenCorporates (which is actually
> an aggregator of individually curated data sets from various registration
> authorities), etc or general like WikiData, Freebase, etc.
>
> Tom
>
> On Thu, Jan 14, 2016 at 10:21 AM, Paul Walsh <paulywalsh at gmail.com> wrote:
>
>> There has been recent discussion around OpenSpending and OpenTrials (two
>> projects at Open Knowledge International) on the need for a solid and well
>> featured entity reconciliation service.
>>
>> The service would help applications which depend on reference data, from
>> country lists, company lists to budget classifications. Examples would be
>> messy source data about party donations, procurement awards, or medicine
>> names.
>>
>> The service would provide support for de-duplication and re-classification
>> of source data dimensions against the canonical reference data; and it
>> would allow the construction of canonical lists from messy source data.
>>
>> Such a service would be generally useful to the wider open data community,
>> and in initial discussion between Friedrich Lindenberg, Mark Brough and
>> Paul Walsh, we came to some shared understanding of what a service might
>> look like at a high level.
>>
>>
>> To learn more about how others have approached this problem, we're putting
>> out a call: We are looking for existing work to build on, open-source tools
>> for reference data. Is there open source code out there that meets many or
>> all of our criteria? If no existing solution can be found, we hack on
>> Nomenklatura (https://github.com/pudo/nomenklatura) to push it in this
>> direction.
>>
>> Features:
>>
>> ? Reconciliation endpoints for particular "collections"
>> ? Geographical
>> ? Budget taxonomies
>> ? Companies
>> ? Namespacing of data
>> ? "collections" is a type of namespacing
>> ? but collections need (?) additional context: such as
>> geographical context for company names
>> ? Distinct reconciliation strategies (possibly exposed as distinct
>> methods of the API)
>> ? Fuzzy, cross field matching
>> ? Primary identifer matching
>> ? Other?
>> ? Read and write against "collections"
>> ? Create the code list based on the data being reconciled ("get or
>> create")
>> ? Confidence level for matches
>> ? Some control over confidence level ("give me the first match
>> over 80% confidence")
>> ? Hook into an array of data stores to match against, possibly
>> mapped to "collections"
>> ? web services (example: opencorporates)
>> ? CSV (hosted somewhere)
>> ? Other databases (connection with credentials)?
>> ? Make higher level abstractions out of multiple data sources
>> ? Example: automate the creation of a geo lookup service by
>> mapping ocd division ids (
>> https://github.com/opencivicdata/ocd-division-ids) onto data from genomes
>> (??)
>> ? Simple, modern web client for user-driven reconciliation of data
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160114/8f528ba7/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 15 Jan 2016 09:13:35 +0200
> From: Paul Walsh <paulywalsh at gmail.com>
> To: Tom Morris <tfmorris at gmail.com>
> Cc: okfn-labs <okfn-labs at lists.okfn.org>
> Subject: Re: [okfn-labs] Entity reconciliation services
> Message-ID: <1F109FAC-F949-4F8C-A58D-5E99781921B6 at gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Tom,
>
>> On 14 Jan 2016, at 7:29 PM, Tom Morris <tfmorris at gmail.com> wrote:
>>
>> Is it safe to assume that you've already reviewed:
>>
>> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-Api <https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-Api>
>
> Yes.
>
>> https://developers.google.com/freebase/v1/reconciliation-overview?hl=en <https://developers.google.com/freebase/v1/reconciliation-overview?hl=en>
>
> Hasn?t this been replaced by the read-only Knowledge Graph API?
>
> https://developers.google.com/knowledge-graph/
>
> If not please let me know otherwise. But, it seems like it could be a good data source for the type of thing we are seeking, but not a candidate for the thing itself.
>
>>
>> The feature list is a bit terse to decipher without the context of the discussion that generated it. Will the bullet points be expanded? Is it a tool, a service, or both?
>
> We can expand on it together, in this thread. The idea here is to get input from others who have more expertise in this area than some or all of us who started the discussion.
>
> I?m not sure what distinguishes a tool and service here for you, but I?d say we are thinking in terms of Nomenklatura as a starting point, so:
>
> 1. An open source app
> 2. A hosted service of the same
>
>> Some things like brokered reconciliation to existing reconciliation services (e.g. OpenCorporates) sound needlessly complex.
>
> I think that is a matter of managing some complexity in an service like this, so that several apps consuming such a service do not have to write similar code to manage complexity.
>
>>
>> There are plenty of open source implementations of reconciliation services, but the problem with all of the ones that I'm familiar with is that they have very primitive/simple scoring mechanisms (prefix match, edit distance, etc).
>
> If there are plenty that match a decent number of things we?ve listed here, it would be great to hear about them.
>
>> They also typically only take a single attribute (ie column in your spreadsheet) when you can often get much more powerful scoring using multiple columns (e.g. name, occupation, birth date, nationality, etc).
>
> Yes, this is pretty crucial.
>
>>
>> Another thing to consider is tabular vs textual entity identification. In the medical domain it's not uncommon to have textual notes that you'd like to identify drugs, procedures, etc in. The surrounding textual context in these cases provides useful information to help identify entities.
>
> Yes, this is clearly a need we have in OpenTrials, and for part of the work there we?ll be using ContentMine ( http://contentmine.org ). In terms of how textual content can relate to a more general entity reconciliation service as we are describing here, I?m just not sure yet.
>
> Best,
>
> Paul
>
>> Data curation is a key component, so I'm a little dubious about the Nomenklatura "dump your data here" approach.
>> I think it's much more successful to have a dedicated curated data source whether it be domain-specific like MusicBrainz, IMDB, OpenCorporates (which is actually an aggregator of individually curated data sets from various registration authorities), etc or general like WikiData, Freebase, etc.
>>
>> Tom
>>
>> On Thu, Jan 14, 2016 at 10:21 AM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
>> There has been recent discussion around OpenSpending and OpenTrials (two projects at Open Knowledge International) on the need for a solid and well featured entity reconciliation service.
>>
>> The service would help applications which depend on reference data, from country lists, company lists to budget classifications. Examples would be messy source data about party donations, procurement awards, or medicine names.
>>
>> The service would provide support for de-duplication and re-classification of source data dimensions against the canonical reference data; and it would allow the construction of canonical lists from messy source data.
>>
>> Such a service would be generally useful to the wider open data community, and in initial discussion between Friedrich Lindenberg, Mark Brough and Paul Walsh, we came to some shared understanding of what a service might look like at a high level.
>>
>>
>> To learn more about how others have approached this problem, we're putting out a call: We are looking for existing work to build on, open-source tools for reference data. Is there open source code out there that meets many or all of our criteria? If no existing solution can be found, we hack on Nomenklatura (https://github.com/pudo/nomenklatura <https://github.com/pudo/nomenklatura>) to push it in this direction.
>>
>> Features:
>>
>> ? Reconciliation endpoints for particular "collections"
>> ? Geographical
>> ? Budget taxonomies
>> ? Companies
>> ? Namespacing of data
>> ? "collections" is a type of namespacing
>> ? but collections need (?) additional context: such as geographical context for company names
>> ? Distinct reconciliation strategies (possibly exposed as distinct methods of the API)
>> ? Fuzzy, cross field matching
>> ? Primary identifer matching
>> ? Other?
>> ? Read and write against "collections"
>> ? Create the code list based on the data being reconciled ("get or create")
>> ? Confidence level for matches
>> ? Some control over confidence level ("give me the first match over 80% confidence")
>> ? Hook into an array of data stores to match against, possibly mapped to "collections"
>> ? web services (example: opencorporates)
>> ? CSV (hosted somewhere)
>> ? Other databases (connection with credentials)?
>> ? Make higher level abstractions out of multiple data sources
>> ? Example: automate the creation of a geo lookup service by mapping ocd division ids (https://github.com/opencivicdata/ocd-division-ids <https://github.com/opencivicdata/ocd-division-ids>) onto data from genomes (??)
>> ? Simple, modern web client for user-driven reconciliation of data
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
>> https://lists.okfn.org/mailman/listinfo/okfn-labs <https://lists.okfn.org/mailman/listinfo/okfn-labs>
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs <https://lists.okfn.org/mailman/options/okfn-labs>
>>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20160115/52215602/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
>
> ------------------------------
>
> End of okfn-labs Digest, Vol 57, Issue 1
> ****************************************
More information about the okfn-labs
mailing list