[open-science] Fwd: [Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Fri Jul 8 23:26:20 UTC 2011

Hi,

I think reproducibility of results is important for science. However, with physical substances, biology, and living and thinking beings, reproducing an exact same result can be difficult if not impossible depending on the accuracy, precision and detail of the result. Digital models that input digital data can, at least in theory, be developed so that identical results can be generated provided the same input data is available and the computational requirements are met. However, the software programs and workflow need to be developed to ensure identical result replicability if that is wanted. Key to digital results replicability are packaging up the input data (or configuring a data feed such that the same input data is pulled), and likewise, packaging up the software programs and workflows (or providing links to these). Developing and providing metadata about the processing and about the computational resources used and details of the system requirement are also key for replicabilit. The more open everything is, the easier it is to provide all the details and allow replication of a result effectively at the press of a button.

In stochastic modelling and when the processing is done in a parallel and on distributed resources, results replicability can be somewhat difficult. Good simulation models should allow for results replication if required. However, often when a trend result is wanted, it can be decided to sacrifice, for computational speed, the abilty to replicate a result exacly. This can lead to not only faster processing, but the need for less metatdata.

Bye for now,

Andy
http://www.geog.leeds.ac.uk/people/a.turner/

________________________________________
From: open-science-bounces at lists.okfn.org [open-science-bounces at lists.okfn.org] On Behalf Of Jack Park [jackpark at gmail.com]
Sent: 08 July 2011 16:18
To: Richard Littauer
Cc: open-science at lists.okfn.org
Subject: Re: [open-science] Fwd: [Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Hi Richard,

It seems to me that there are two threads going on below: one is about
submitting code and data at submission time, which I compare to
publication time; the other appears to be a pushback against
publishing full code at all.  Putting snippets or pseudocode in full
papers is done all the time; full listings in papers seems unlikely.
Arguing against making full code available, e.g. at github, etc,
doesn't make sense.  Many people will never look at it, but making it
available is always a benefit.

I wonder whether the real argument is against any requirement that
full code be made available as a condition for publication.

Jack

On Fri, Jul 8, 2011 at 6:19 AM, Richard Littauer
<richard.littauer at gmail.com> wrote:
> Hey Open Scientists,
> I don't know if any of you are subscribed to the ECOLOG-L, but this thread
> came up over the past few weeks and it was a pretty interesting read about
> publishing code and data together with research. I'm not quite sure where to
> go with this - I haven't responded - but it might be worth plugging the work
> of the OKF and the Open Science list as a response? Let me know what you
> think and I'll do it.
>
> That's all,
> Richard Littauer
>
> ---------- Forwarded message ----------
> From: Rich Cooper <rich at englishlogickernel.com>
> Date: Mon, Jul 4, 2011 at 8:53 PM
> Subject: Re: [Corpora-List] discussion on reproducibility at ACL 2011
> business meeting
> To: Ruvan Weerasinghe <arw at ucsc.cmb.ac.lk>, tpederse at d.umn.edu
> Cc: nlpatumd at yahoogroups.com, corpora at uib.no
>
>
> Hi Ruvan,
>
>
>
> Generally, publishing academic code is a good idea, but publishing real code
> isn’t feasible.  If the code is small enough to publish, it is merely an
> aspect of the full requirements for linguistic interaction.  But even
> programs of less than a hundred thousand lines simply don’t have that much
> functionality.
>
>
>
> However, publishing code snippets, like the Link Grammar developers did,
> would be very useful.  In the LG case, the publications included a very
> clear exposition of how constraint propagation can be applied to simple
> context free grammars to cover, at most, ten percent of the kind of
> linguistic conversations that are required.  But the abstraction is limited
> to parsing, not to linguistic interaction.  That made the complexity of the
> ideas match the originality of the published procedures.
>
>
>
> I agree that publishing such abstraction snippets is a good idea, but only
> for appropriate levels of detail.  Beyond that, it gets unreasonably
> complicated for others to learn from effectively.
>
>
>
> Abstractions teach.  But full code publication merely confuses.  So it
> should be a question of which papers should contain published algorithms to
> demonstrate simple slices of a full system.
>
>
>
> -Rich
>
>
>
> Sincerely,
>
> Rich Cooper
>
> EnglishLogicKernel.com
>
> Rich AT EnglishLogicKernel DOT com
>
> 9 4 9 \ 5 2 5 - 5 7 1 2
>
> ________________________________
>
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
> Ruvan Weerasinghe
> Sent: Sunday, July 03, 2011 6:47 PM
> To: tpederse at d.umn.edu
> Cc: nlpatumd at yahoogroups.com; corpora at uib.no
> Subject: Re: [Corpora-List] discussion on reproducibility at ACL 2011
> business meeting
>
>
>
> May be we can address some of the issues raised by talking to the Biology
> (Bioinformatics) people who seem to make publishing data and code a
> precondition for publication?
>
> Regards.
>
>
> Ruvan Weerasinghe
> University of Colombo School of Computing
> Colombo 00700,
> Sri Lanka.
>
> Web:    http://www.ucsc.lk
> Phone:  +94112158953; Fax:    +94112587239
>
> ________________________________
>
> From: "Ted Pedersen" <tpederse at d.umn.edu>
> To: corpora at uib.no
> Cc: nlpatumd at yahoogroups.com
> Sent: Sunday, July 3, 2011 11:40:05 PM
> Subject: [Corpora-List] discussion on reproducibility at ACL 2011
> business        meeting
>
> Greetings all,
>
>
>
> I made a few remarks during the ACL 2011 business meeting in favor of the
> innovation this year on allowing submissions of data and code along with
> paper submissions. I suggested this is something we want to continue and
> encourage, particularly for papers submitted to the empirical track at ACL
> (which is the majority of papers these days) so that we might be able to
> reproduce results more easily. I had some slides prepared that I didn't use,
> but I've put those here that summarize part of what I said at least (I
> forgot a few points, but the gist is fairly consistent I guess...):
>
>
>
> http://www.slideshare.net/duluthted/pedersen-acl2011businessmeeting
>
>
>
> There were quite a few comments thereafter and I took a few notes, and I
> guess I thought it would be possibly useful to preserve these "for the
> record" at least, since I think that discussion raised many of the common
> concerns about this issue. It might also be an opportunity for folks to
> follow up or at least continue thinking.
>
>
>
> Below are the comments, approximately in the order made....note that I'm
> trying here to simply reproduce the gist of comments, and not offer any
> opinion on them. I think it was great there was such an extensive
> discussion, and I guess I just wanted to note that and preserve it as best I
> could. If anyone feels like they have been misquoted, forgotten, or
> misunderstood, please feel free to jump in and elaborate.
>
>
>
> 0) Speaker was in support of the encouraging more submissions of code and
> data, and noted that he was happy to see quite a few presentations at ACL
> where code and data were being made available.
>
>
>
> 1) Data is sometimes expensive to create (especially speech data) and
> releasing it after one publication may not be in the best interests of the
> creators.
>
>
>
> 2) Reviewing code is time consuming (and another concern raised during the
> business meeting was reviewer overload, so this certainly fit into that
> theme).
>
>
>
> 3) It is often hard or impossible for people in industrial settings to
> release code - the licensing issues are sometimes very complex and would
> need to be resolved before any code was submitted.
>
>
>
> 4) There could be a prize offered for the best code / best data submitted .
>
>
>
> 5) It is hard to know how to review software.
>
>
>
> 6) Maybe software could be made available on an ACL cloud, in order to solve
> some licensing concerns (especially of industry)
>
>
>
> 7) Code at submission time is very hard to anonymize - maybe we
> need separate reviewers for code and data (from paper).
>
>
>
> 8) Simply releasing or submitting code isn't necessarily useful (if it is
> bad code). How do we make sure the code is of high quality and/or useful?
>
>
>
> 9) There is a tension between having new and exciting ideas and producing
> well engineered code. Put another way, there's a tension between pushing the
> envelope and playing it safe. The speaker was concerned we might be moving
> too far away from encouraging new ideas.
>
>
>
> 10) Releasing code will in the end help the impact of work. If you look at
> high impact work in our field, it often centers around a resource (eg Penn
> Treebank). Releasing code can also help people in industry, because
> sometimes publishing code is the only way that it will ever get out (eg
> sentence alignment code from CL in 1993 by Gale and Church)
>
>
>
> 11) Have a retroactive prize after a few years for software systems that are
> released and are proven to have some impact.
>
>
>
> 12) During the discussion of the new journal, it was mentioned that maybe
> that could be a vehicle for releasing code and data.
>
>
>
> I'm grateful that the ACL opened up the business meeting to these kinds of
> remarks, and really appreciate both the opportunity to say a few words, and
> also hear all these different views. It's given me a lot to think about, and
> I just wanted to pass along my notes in the hopes of encouraging others to
> do the same. Keep talking. :)
>
>
>
> Enjoy,
>
> Ted
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>
>

_______________________________________________
open-science mailing list
open-science at lists.okfn.org
http://lists.okfn.org/mailman/listinfo/open-science