[open-science] Fwd: [Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Richard Littauer richard.littauer at gmail.com
Sat Jul 9 12:46:17 UTC 2011


Hey Peter, Jack, and Andy,

I think you all make really good points. I'm not a developer or a coder, so
I don't have much to offer in the way of personal comments on them - shall I
circle your emails back into the Ecolog discussion and see if anyone
responds with interesting comments? It's good to keep Open discussions
flowing, I feel, as that's one of the main ways that advancement can happen.
Let me know.

Richard

On Sat, Jul 9, 2011 at 12:56 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

>
>
> On Fri, Jul 8, 2011 at 2:19 PM, Richard Littauer <
> richard.littauer at gmail.com> wrote:
>
>> Hey Open Scientists,
>>
>> I don't know if any of you are subscribed to the ECOLOG-L, but this thread
>> came up over the past few weeks and it was a pretty interesting read about
>> publishing code and data together with research. I'm not quite sure where to
>> go with this - I haven't responded - but it might be worth plugging the work
>> of the OKF and the Open Science list as a response? Let me know what you
>> think and I'll do it.
>>
>> That's all,
>> Richard Littauer
>>
>> Thanks Richard
>
> I disagree with the poster below. It is possible to build systems which are
> reproducible - not always but
>
>>
>> ** ** ** ** **
>>
>> Hi Ruvan,****
>>
>> ** **
>>
>> Generally, publishing academic code is a good idea, but publishing real
>> code isn’t feasible.  If the code is small enough to publish, it is merely
>> an aspect of the full requirements for linguistic interaction.  But even
>> programs of less than a hundred thousand lines simply don’t have that much
>> functionality.
>>
>
> If you try to build everything into a monolith this may be true, but if the
> functionality is distributed into libraries then the problem gets much
> easier.  Systems such as Maven (mainly Java) and Hudson/Jenkins allow builds
> of a million lines.
>
>>  ****
>>
>> ** **
>>
>> However, publishing code snippets, like the Link Grammar developers did,
>> would be very useful.  In the LG case, the publications included a very
>> clear exposition of how constraint propagation can be applied to simple
>> context free grammars to cover, at most, ten percent of the kind of
>> linguistic conversations that are required.  But the abstraction is limited
>> to parsing, not to linguistic interaction.  That made the complexity of the
>> ideas match the originality of the published procedures.
>>
>
> I don't know the background, but I do work in computational linguistics
> (for chemistry). There are some reasonable libraries (GATE, OpenNLP, NLTK,
> etc.) but they don't interoperate well.
>
> A major problem is corpora and here the major publishers are extremely
> uncooperative. They have consistently refused to let me and others use
> academic material, and are therefore guilty of holding this part of science
> back.
>
>>  ****
>>
>> ** **
>>
>> I agree that publishing such abstraction snippets is a good idea, but only
>> for appropriate levels of detail.  Beyond that, it gets unreasonably
>> complicated for others to learn from effectively.  ****
>>
>> ** **
>>
>> Abstractions teach.  But full code publication merely confuses.
>>
>
> This is defeatism. A well defined architecture should be able to scale to
> significant problems. There is a cost - it takes many person years to
> develop the libraries and support tools. So I could accept that "we do not
> have the money to create good code", but not that it can't be done.
>
>
>>  So it should be a question of which papers should contain published
>> algorithms to demonstrate simple slices of a full system.  ****
>>
>> **
>>
>
> I've heard this argument repeatedly. It often comes from those whose
> motivation is to publish academic papers rather than distribute their code
> for others. Algorithms are only one part of code - the rest is data and
> integration. There are massive code bases which aren't distributable but a
> great deal of scientific computation is.
>
>  -Rich****
>>
>> ** **
>>
>>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20110709/1111a03b/attachment-0001.html>


More information about the open-science mailing list