[open-science] Fwd: [Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Sat Jul 9 15:33:49 UTC 2011

What Peter said counts for me as well.

On Sat, Jul 9, 2011 at 5:55 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>
>
> On Sat, Jul 9, 2011 at 1:46 PM, Richard Littauer
> <richard.littauer at gmail.com> wrote:
>>
>> Hey Peter, Jack, and Andy,
>> I think you all make really good points. I'm not a developer or a coder,
>> so I don't have much to offer in the way of personal comments on them -
>> shall I circle your emails back into the Ecolog discussion and see if anyone
>> responds with interesting comments? It's good to keep Open discussions
>> flowing, I feel, as that's one of the main ways that advancement can happen.
>> Let me know.
>
> you can repost anything you like - it's on the public website. I was writing
> for the benefit of the OKF list.
>
>>
>> Richard
>>
>> On Sat, Jul 9, 2011 at 12:56 AM, Peter Murray-Rust <pm286 at cam.ac.uk>
>> wrote:
>>>
>>>
>>> On Fri, Jul 8, 2011 at 2:19 PM, Richard Littauer
>>> <richard.littauer at gmail.com> wrote:
>>>>
>>>> Hey Open Scientists,
>>>> I don't know if any of you are subscribed to the ECOLOG-L, but this
>>>> thread came up over the past few weeks and it was a pretty interesting read
>>>> about publishing code and data together with research. I'm not quite sure
>>>> where to go with this - I haven't responded - but it might be worth plugging
>>>> the work of the OKF and the Open Science list as a response? Let me know
>>>> what you think and I'll do it.
>>>>
>>>> That's all,
>>>> Richard Littauer
>>>>
>>> Thanks Richard
>>>
>>> I disagree with the poster below. It is possible to build systems which
>>> are reproducible - not always but
>>>>
>>>> Hi Ruvan,
>>>>
>>>>
>>>>
>>>> Generally, publishing academic code is a good idea, but publishing real
>>>> code isn’t feasible.  If the code is small enough to publish, it is merely
>>>> an aspect of the full requirements for linguistic interaction.  But even
>>>> programs of less than a hundred thousand lines simply don’t have that much
>>>> functionality.
>>>
>>> If you try to build everything into a monolith this may be true, but if
>>> the functionality is distributed into libraries then the problem gets much
>>> easier.  Systems such as Maven (mainly Java) and Hudson/Jenkins allow builds
>>> of a million lines.
>>>>
>>>>
>>>>
>>>> However, publishing code snippets, like the Link Grammar developers did,
>>>> would be very useful.  In the LG case, the publications included a very
>>>> clear exposition of how constraint propagation can be applied to simple
>>>> context free grammars to cover, at most, ten percent of the kind of
>>>> linguistic conversations that are required.  But the abstraction is limited
>>>> to parsing, not to linguistic interaction.  That made the complexity of the
>>>> ideas match the originality of the published procedures.
>>>
>>> I don't know the background, but I do work in computational linguistics
>>> (for chemistry). There are some reasonable libraries (GATE, OpenNLP, NLTK,
>>> etc.) but they don't interoperate well.
>>>
>>> A major problem is corpora and here the major publishers are extremely
>>> uncooperative. They have consistently refused to let me and others use
>>> academic material, and are therefore guilty of holding this part of science
>>> back.
>>>>
>>>>
>>>>
>>>> I agree that publishing such abstraction snippets is a good idea, but
>>>> only for appropriate levels of detail.  Beyond that, it gets unreasonably
>>>> complicated for others to learn from effectively.
>>>>
>>>>
>>>>
>>>> Abstractions teach.  But full code publication merely confuses.
>>>
>>> This is defeatism. A well defined architecture should be able to scale to
>>> significant problems. There is a cost - it takes many person years to
>>> develop the libraries and support tools. So I could accept that "we do not
>>> have the money to create good code", but not that it can't be done.
>>>
>>>>
>>>> So it should be a question of which papers should contain published
>>>> algorithms to demonstrate simple slices of a full system.
>>>
>>> I've heard this argument repeatedly. It often comes from those whose
>>> motivation is to publish academic papers rather than distribute their code
>>> for others. Algorithms are only one part of code - the rest is data and
>>> integration. There are massive code bases which aren't distributable but a
>>> great deal of scientific computation is.
>>>
>>>> -Rich
>>>>
>>>>
>>>
>>>
>>> --
>>> Peter Murray-Rust
>>> Reader in Molecular Informatics
>>> Unilever Centre, Dep. Of Chemistry
>>> University of Cambridge
>>> CB2 1EW, UK
>>> +44-1223-763069
>>
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>
>