[Open-access] [GOAL] Re: Re: Fight Publishing Lobby's Latest "FIRST" Act to Delay OA - Nth Successor to PRISM, RWA etc.

Fri Nov 29 15:27:38 UTC 2013

On Fri, Nov 29, 2013 at 2:32 PM, Bjoern Brembs <b.brembs at gmail.com> wrote:
> On Friday, November 22, 2013, 5:28:12 PM, Stevan Harnad wrote:
>> What's OA is OA, and harvesters, including PMC, will find them.
>
> That's what I want to do - on a global scale.
That would be nice indeed, but the problem is that
(a) finding a publication on a site other than the publisher's does
not necessarily mean that file is legally there, or even that it's
easy to determine (let alone algorithmically) whether that is the case
(b) mining those files may be prohibited by the copyright holder's
terms and conditions.
(c) even in cases where licensing information is available, it is
often not correct.

Problem (b) even exists for most of PubMed Central, and problem (c)
even in the open subset thereof [1]. Problem (a) does not affect PMC
(they have been very careful there), but many other places (e.g.
author's websites).

Instead of crawling for the publications themselves, it may be less
problematic from a legal point of view to just have a platform that
aggregates metadata of publications, along with a link to a legal copy
(green or gold). The problems here are
(d) the metadata is often not present with the file(s) of the
publication, or only incompletely
(e) in some cases, even the metadata may not be licensed for such
programmatic access (I actually do not know of such a case right now,
but I think I saw that somewhere - please chime in if anyone knows
details)
(f) the official repositories are far from interoperable
(g) other repositories (e.g. author websites) are far too short-lived
in order to make harvesting them a useful endeavor for anyone smaller
than Google Scholar (and even repositories fail with that on
occasion[2])

For these reasons, I think it is best to develop crawling
infrastructure around the clearly licensed literature first (which is
a rather small subset at present), and to use that as a basis for
exploring options to broaden access more widely. That's the approach
we have taken with the Open Access Media Importer, which only looks at
articles under CC BY and CC0 and puts their multimedia files onto
Wikimedia Commons[3]. The obvious next step is to think about ways to
aggregate the openly licensed literature (full text plus supplements)
in some way, ideally beyond the scope of biomedicine. We will do some
tests in this direction with Wikisource.

One way to soften the problems (a)-(c) is to signal for any cited
reference whether it can be read, mined, copied etc. The NISO
Workgroup on Open Access Metadata and Indicators[4] is working on
that, and a similar system is currently being explored for references
cited on Wikipedia[5], which would benefit from wider community input.

Cheers,

Daniel

[1]
http://www.ncbi.nlm.nih.gov/books/NBK159964/
[2]
http://scholar.google.de/scholar?hl=en&q=10141%2F64034&btnG=&as_sdt=1%2C5&as_sdtp=
[3]
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
[4]
http://www.niso.org/workrooms/oami/
[5]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Open_Access/Signalling_OA-ness