[Open-contentmining] Introduction and some links

Geoffrey Bilder gbilder at crossref.org
Fri Dec 13 10:58:07 UTC 2013

Hi, thought I'd kick-off by pointing at a few initiatives that we've started CrossRef and that we hope will help with text and data mining. 

The first is Prospect, an effort of ours to enable the DOI to be used as the basis for a standard, cross-publisher API for retrieving the full text of content for the purposes of TDM. Prospect is designed to address three key technical and logistical issues that are hindering TDM:

	• All parties would benefit from support of standard APIs and data representations in order to enable TDM across both open access and subscription-based publishers.
	• Researchers find it impractical to negotiate multiple bilateral agreements with subscription-based publishers in order to get authorisation to TDM  subscribed content.
	• Subscription-based publishers find it impractical to negotiate multiple bilateral agreements with researchers and institutions in order to authorise TDM of subscribed content.

We are trying to address these issues by providing:

	• A “Common API” , that can be used by researchers to access the full text of content identified by CrossRef DOIs across publisher sites and regardless of their business model.
	• An optional “License Click-Through Service” , that can be used by researchers and publishers as an efficient mechanism to provide “click-through” agreement of proprietary TDM licenses.

Both components are free to use by researchers and the public.

Obviously, beyond the technical and logistical issues, there are policy issues that are making TDM difficult. CrossRef does not involve itself policy issues, but we note that often the technical, logistical and policy issues are conflated. We hope that, by resolving (or at least reducing) the technical/logistical issues, we can allow others to focus on the more substantive policy issues.

You can find more detailed info about Prospect here. 


Note these are working documents and we are looking for feedback, comments, etc.

Prospect was a pilot and is now moving into production. Obviously, it will only really work once publishers start populating the relevant metadata (resource links and license info). We expect that publishers will start to populate this data in earnest over the next two quarters as they will need to do so in order to show they are meeting the requirements of the OSTP memo. We are therefor encouraging publishers to start populating this metadata ASAP. 


It would be great, in particular, if open access publishers followed this best practice quickly. 

We also have a tranche of API work that we are doing, in order to support funding agencies, but which we suspect will also be critical to text and data miners. This API is in the early stages of development and also depends on relevant metadata being deposited. We would appreciate your feedback on this as we develop it as TDM users will be an important use-case. Again, please note that although this documentation is geared at funding agencies- the API itself is a general API and will be free-to-use just like most of our other APIs. As is the case with all of our free APIs, you can basically do what you like with the data you retrieve from them.


These APIs are al based on the engine that drives CrossRef Metadata Search here:


And FundRef search here:


Finally- we have a host of experiments that you might be interested in on our CrossRef Labs site. Some of the experiments are dormant, but if we see renewed interest in them, they can always be revived and re-prioritized.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-contentmining/attachments/20131213/020769d6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.okfn.org/pipermail/open-contentmining/attachments/20131213/020769d6/attachment.sig>

More information about the open-contentmining mailing list