[open-science] feedback wanted on text-mining initiatives

Peter Murray-Rust pm286 at cam.ac.uk
Fri Apr 27 17:15:28 UTC 2012


On Fri, Apr 27, 2012 at 5:51 PM, Peter Suber <peters at earlham.edu> wrote:

> I strongly support the kind of open text-mining declaration Peter MR
> describes.
>
> Peter is exactly right in his understanding of the purpose of the BBB
> statements in the OA domain. They defined OA and set out goals. Sometimes
> they described in general terms some strategies for achieving those goals.
> (For example, the Budapest statement described what we now call green and
> gold OA as methods for delivering OA.)  But they left implementation
> details to those inspired to reach the goals, didn't dictate just one path
> to reach the goals, and tried not to say anything that might be made
> obsolete by advances in technology.
>
> And they have brilliantly stood the test of at least ten years.


> I'm far from a specialist in text mining, but I'd be happy to help the
> cause in any way that I could.
>
> Thanks - I will mail separately.


> One question:  should the text-mining declaration cover data mining as
> well? Or are the issues so different that they deserve separate treatment?
>
> I see them as complementary. "Data" has a degree of syntactic
accessibility - "text" does not. For example a CSV file with numbers in is
"data", while a narrative of a chemical reaction is "text". Supplemental
info is sometimes data (e.g. CIF files or Excel spreadsheets), sometimes
semi-structured text.

A key problem is the word "text". There must not be a distinction in how
the information is held. In is inappropriate to argue that
"two grams" "twenty degrees" is text and so a "create work" and
"2 g" "20 deg C" is factual data.

For that reason our declaration must cover:
* text
* diagrams (line drawings, graphs, spectra, networks, etc.)
* images ( mainly photographic)
* audio (e.g. animal recordings)
* video (e.g. behavioural studies)

All of these can be the primary means of reporting scientific fact, yet
"images" are treated as creative works and so often rented for large
amounts of money by commercial publishers. I can remember when one
publishers REDREW my scatterplots to avoid copyright as they had been
published elsewhere. This was factual data, produced by computer. Not
surprised they introduced significant distortion - and that was 30+ years
ago.

P.


     Peter S.
>
> Peter Suber
> gplus.to/petersuber
>
>
> On Fri, Apr 27, 2012 at 10:25 AM, Peter Murray-Rust <pm286 at cam.ac.uk>wrote:
>
>>
>>
>> On Fri, Apr 27, 2012 at 2:28 PM, Richard Kidd <KiddR at rsc.org> wrote:
>>
>>> On Fri, Apr 27, 2012 at 1:40 PM, Richard Kidd <KiddR at rsc.org> wrote:****
>>>
>>> > > Among the things which we probably should not address are:
>>> > > * what can and cannot be mined and reproduced****
>>>
>>>
>>>
>>> Apols for the misunderstanding, my fault for using ‘open’ in a diff
>>> context - my request is that “what can and cannot be mined and reproduced” *
>>> *should** be discussed and addressed – as it’s the key issue. Am trying
>>> to get comments up on yr blog post but it’s not behaving…****
>>>
>>>
>>>
>>
>> Thanks and understood. It is critical that is *is* discussed, and very
>> possibly on the OKF site.
>>
>> The point here is to create a declaration about text-mining similar to
>> Budapest/Berlin/Bethesda [for Open Access]. They deliberately do not go
>> into details, but state a goal that can later be reified in law and
>> practice. They state what "Open Access" means in general terms. Phrases
>> like "for whatever purpose", "everybody", "without further permission".
>> They do NOT state that there should be a licence - licences are simply one
>> way of implementing them.
>>
>> Let us call the Declaration of textmining the "Open Text Mining
>> Declaration". (It's slightly but not very contaminated by NPG's "Open
>> Text-mining Initiative" which most people have forgotten. It should be
>> brief - perhaps 2-3 lines at. It would define Open Text-mining...
>>
>> "By Open textmining we mean ... everyone ... without further permission
>> ... available to all ...".
>>
>> That does not mean that everyone must agree to do it. It is a goal. The
>> BBB declarations are not yet implemented universally. But they are the
>> yardstick that most of us use. They are particularly useful because so many
>> people and organisations create their own usage for "Open Access" without
>> defining it - thus causing confusion. We wish to avoid this for this new
>> field.
>>
>> The details come second, and change as the world and technology changes.
>> It is generally agreed that CC-BY licences permit text- and other mining
>> without further permission. Contrast that with almost everything else where
>> nothing is clear.
>>
>> If 20 scientists per university wish to text-mine that means 1000
>> universities * 20 scientists * 100 publishers == potentially 1 million
>> requests. The system can't cope. So the only ways forward are:
>> * refuse everything. There seem to be publishers who take this view. It
>> has the virtue of clarity
>> * permit everything. There are certainly publishers (BMC/PLoS) who take
>> this view
>> * leave everything unclear. "consult your librarian" "we'll discuss this
>> with our marketing people". That's the position for most publishers.
>>
>> Fuzziness is destroying scientific progress and creating tensions. The
>> OTMD is an attempt to bring some clarity. Whether any given publisher
>> accepts , rejects or ignores it is irrelevant to the wording of the
>> declaration.
>>
>> P.
>>
>>
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>>
>>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120427/9f392cc4/attachment-0001.html>


More information about the open-science mailing list