[open-science] Fwd: [Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Fri Jul 8 23:56:40 UTC 2011

On Fri, Jul 8, 2011 at 2:19 PM, Richard Littauer <richard.littauer at gmail.com
> wrote:

> Hey Open Scientists,
>
> I don't know if any of you are subscribed to the ECOLOG-L, but this thread
> came up over the past few weeks and it was a pretty interesting read about
> publishing code and data together with research. I'm not quite sure where to
> go with this - I haven't responded - but it might be worth plugging the work
> of the OKF and the Open Science list as a response? Let me know what you
> think and I'll do it.
>
> That's all,
> Richard Littauer
>
> Thanks Richard

I disagree with the poster below. It is possible to build systems which are
reproducible - not always but

>
> ** ** ** ** **
>
> Hi Ruvan,****
>
> ** **
>
> Generally, publishing academic code is a good idea, but publishing real
> code isn’t feasible.  If the code is small enough to publish, it is merely
> an aspect of the full requirements for linguistic interaction.  But even
> programs of less than a hundred thousand lines simply don’t have that much
> functionality.
>

If you try to build everything into a monolith this may be true, but if the
functionality is distributed into libraries then the problem gets much
easier.  Systems such as Maven (mainly Java) and Hudson/Jenkins allow builds
of a million lines.

> ****
>
> ** **
>
> However, publishing code snippets, like the Link Grammar developers did,
> would be very useful.  In the LG case, the publications included a very
> clear exposition of how constraint propagation can be applied to simple
> context free grammars to cover, at most, ten percent of the kind of
> linguistic conversations that are required.  But the abstraction is limited
> to parsing, not to linguistic interaction.  That made the complexity of the
> ideas match the originality of the published procedures.
>

I don't know the background, but I do work in computational linguistics (for
chemistry). There are some reasonable libraries (GATE, OpenNLP, NLTK, etc.)
but they don't interoperate well.

A major problem is corpora and here the major publishers are extremely
uncooperative. They have consistently refused to let me and others use
academic material, and are therefore guilty of holding this part of science
back.

> ****
>
> ** **
>
> I agree that publishing such abstraction snippets is a good idea, but only
> for appropriate levels of detail.  Beyond that, it gets unreasonably
> complicated for others to learn from effectively.  ****
>
> ** **
>
> Abstractions teach.  But full code publication merely confuses.
>

This is defeatism. A well defined architecture should be able to scale to
significant problems. There is a cost - it takes many person years to
develop the libraries and support tools. So I could accept that "we do not
have the money to create good code", but not that it can't be done.

> So it should be a question of which papers should contain published
> algorithms to demonstrate simple slices of a full system.  ****
>
> **
>

I've heard this argument repeatedly. It often comes from those whose
motivation is to publish academic papers rather than distribute their code
for others. Algorithms are only one part of code - the rest is data and
integration. There are massive code bases which aren't distributable but a
great deal of scientific computation is.

 -Rich****
>
> ** **
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20110709/0422dc93/attachment-0001.html>