[open-science] query regarding language in permission to mine - with clause

Peter Murray-Rust pm286 at cam.ac.uk
Wed Jul 25 17:23:21 UTC 2012


On Wed, Jul 25, 2012 at 5:54 PM, Diane Cabell <dc at icommons.org> wrote:

> Thanks Peter and Cameron.  Can you bear to educate me even further?
>
> Sec 4.2 of the JISC report on Value and benefits of text mining<http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx> gave
> an example of 2930 full-text articles in which the word "malaria" appears.
>  Am I wrong to think that a miner would have to cite all 2930?  Or if the
> mining finds a pattern that only appears in 900 of them, then must cite the
> entire 900?  What is the 'proximate' source in such cases: an Elselvier
> database of journal articles?
> dc
>
>
A similar problem occurs in data analysis. Let's say someone analyses the
size of molecules from the Protein Data Bank and uses 25,000 . For the
purposes of SCIENCE they should report the identifiers for the proteins
that fall into a particular category (e.g. we found 23 proteins with holes
in them ABC1, DEF2, ECH1). But if you started with 24,758 entries no-one
would cite all 24,578. And they are doing this primarily for fidelity of
the science.

With papers more "value" is currently given to citations (many of us want
to change the balance). But imagine you are listing the spellings of (say)
haemoglobine in the literature - a valuable thing to do for text-mining.
Yiou wouldn't say ("in 24,123 papers it was spelt hemoglobin and then cite
all of them. But you might say "in 10 papers it was spelt haemoglobine" and
then cite them as that might be the interesting bit. That's a poor example
_could do better.

So if there are more than X papers then I would only cite the details if it
was required for scientific reproducibility. I don't know what X is - it
will depend. In any case most such citation of bulk resources will almost
always be through identifiers

>
> On Jul 19, 2012, at 4:18 PM, cameronneylon.net wrote:
>
> Yes, this bothers me for the reason you state. It doesn't feel like a good
> fit for what I feel is "the right thing to do".
>
> It seems to me that the expected community norm would be to cite the
> proximate source (i.e. it is reasonable to cite the immediate source of
> data in some form - but there isn't an expectation to cite "deeper"
> sources). Where this is a large set that is more challenging but it would
> still be good practice anyway to maintain a list of sources, even if not
> with the primary distribution. This is just good provenance information for
> the derivative work.
>
> Could something work along the lines of:
>
> "Where a licenced work is used as input for the purposes of text, data or
> for other information mining processes, attribution may be complex or
> challenging. The licensee shall make all reasonable efforts to clearly
> attribute the immediate (proximate? - is there a term for this?) source or
> sources of data. It is reasonable for the purposes of this licence for such
> attribution to made available as a separate document or work from the
> distributed product of said mining. Where systems exist to track
> attribution and citation licensee shall make all reasonable efforts to
> provide notification to such systems when their means of satisfying the
> attribution requirement differs from that for single derivative works."
>
> IANAL!
>
>
>
> On 19 Jul 2012, at 19:44, Diane Cabell wrote:
>
> Apologies!  The relevant clause is:
>
>
> "The [requirement of attribution] shall not apply to the products of text,
> data and other information mining processes where the Work or portions
> thereof that appear in the mined product are not identifiable by their
> original source."
>
>
>
>
> On Jul 19, 2012, at 10:37 AM, Diane Cabell wrote:
>
>
> This is the draft of a clause that would waive attribution when mining.
>
>
> Do any of you have any thoughts on whether this language is reasonable for
> the purpose?  Is attribution stacking a problem that truly needs to be
> addressed?
>
>
> I am a bit concerned that it leaves a possible loophole allowing the miner
> to intentionally omit any identifiable characteristics simply to avoid
> attribution.  Is that an inescapable problem?
>
>
> Any advice would be appreciated.
>
>
> Diane Cabell
>
> iCommons Ltd
>
> Creative Commons
>
> OeRC
>
> _______________________________________________
>
> open-science mailing list
>
> open-science at lists.okfn.org
>
> http://lists.okfn.org/mailman/listinfo/open-science
>
>
> _______________________________________________
>
> open-science mailing list
>
> open-science at lists.okfn.org
>
> http://lists.okfn.org/mailman/listinfo/open-science
>
>
>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120725/49840e69/attachment-0001.html>


More information about the open-science mailing list