[open-science] Fwd: [Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Peter Murray-Rust pm286 at cam.ac.uk
Sat Jul 9 12:55:05 UTC 2011


On Sat, Jul 9, 2011 at 1:46 PM, Richard Littauer <richard.littauer at gmail.com
> wrote:

> Hey Peter, Jack, and Andy,
>
> I think you all make really good points. I'm not a developer or a coder, so
> I don't have much to offer in the way of personal comments on them - shall I
> circle your emails back into the Ecolog discussion and see if anyone
> responds with interesting comments? It's good to keep Open discussions
> flowing, I feel, as that's one of the main ways that advancement can happen.
> Let me know.
>

you can repost anything you like - it's on the public website. I was writing
for the benefit of the OKF list.


>
> Richard
>
>
> On Sat, Jul 9, 2011 at 12:56 AM, Peter Murray-Rust <pm286 at cam.ac.uk>wrote:
>
>>
>>
>> On Fri, Jul 8, 2011 at 2:19 PM, Richard Littauer <
>> richard.littauer at gmail.com> wrote:
>>
>>> Hey Open Scientists,
>>>
>>> I don't know if any of you are subscribed to the ECOLOG-L, but this
>>> thread came up over the past few weeks and it was a pretty interesting read
>>> about publishing code and data together with research. I'm not quite sure
>>> where to go with this - I haven't responded - but it might be worth plugging
>>> the work of the OKF and the Open Science list as a response? Let me know
>>> what you think and I'll do it.
>>>
>>> That's all,
>>> Richard Littauer
>>>
>>> Thanks Richard
>>
>> I disagree with the poster below. It is possible to build systems which
>> are reproducible - not always but
>>
>>>
>>> ** ** ** ** **
>>>
>>> Hi Ruvan,****
>>>
>>> ** **
>>>
>>> Generally, publishing academic code is a good idea, but publishing real
>>> code isn’t feasible.  If the code is small enough to publish, it is merely
>>> an aspect of the full requirements for linguistic interaction.  But even
>>> programs of less than a hundred thousand lines simply don’t have that much
>>> functionality.
>>>
>>
>> If you try to build everything into a monolith this may be true, but if
>> the functionality is distributed into libraries then the problem gets much
>> easier.  Systems such as Maven (mainly Java) and Hudson/Jenkins allow builds
>> of a million lines.
>>
>>>  ****
>>>
>>> ** **
>>>
>>> However, publishing code snippets, like the Link Grammar developers did,
>>> would be very useful.  In the LG case, the publications included a very
>>> clear exposition of how constraint propagation can be applied to simple
>>> context free grammars to cover, at most, ten percent of the kind of
>>> linguistic conversations that are required.  But the abstraction is limited
>>> to parsing, not to linguistic interaction.  That made the complexity of the
>>> ideas match the originality of the published procedures.
>>>
>>
>> I don't know the background, but I do work in computational linguistics
>> (for chemistry). There are some reasonable libraries (GATE, OpenNLP, NLTK,
>> etc.) but they don't interoperate well.
>>
>> A major problem is corpora and here the major publishers are extremely
>> uncooperative. They have consistently refused to let me and others use
>> academic material, and are therefore guilty of holding this part of science
>> back.
>>
>>>  ****
>>>
>>> ** **
>>>
>>> I agree that publishing such abstraction snippets is a good idea, but
>>> only for appropriate levels of detail.  Beyond that, it gets unreasonably
>>> complicated for others to learn from effectively.  ****
>>>
>>> ** **
>>>
>>> Abstractions teach.  But full code publication merely confuses.
>>>
>>
>> This is defeatism. A well defined architecture should be able to scale to
>> significant problems. There is a cost - it takes many person years to
>> develop the libraries and support tools. So I could accept that "we do not
>> have the money to create good code", but not that it can't be done.
>>
>>
>>>  So it should be a question of which papers should contain published
>>> algorithms to demonstrate simple slices of a full system.  ****
>>>
>>> **
>>>
>>
>> I've heard this argument repeatedly. It often comes from those whose
>> motivation is to publish academic papers rather than distribute their code
>> for others. Algorithms are only one part of code - the rest is data and
>> integration. There are massive code bases which aren't distributable but a
>> great deal of scientific computation is.
>>
>>  -Rich****
>>>
>>> ** **
>>>
>>>
>>
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>>
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20110709/e1793f70/attachment-0001.html>


More information about the open-science mailing list