[open-science] Fwd: [Corpora-List] discussion on reproducibility at ACL 2011 business meeting

Fri Jul 8 13:19:15 UTC 2011

Hey Open Scientists,

I don't know if any of you are subscribed to the ECOLOG-L, but this thread
came up over the past few weeks and it was a pretty interesting read about
publishing code and data together with research. I'm not quite sure where to
go with this - I haven't responded - but it might be worth plugging the work
of the OKF and the Open Science list as a response? Let me know what you
think and I'll do it.

That's all,
Richard Littauer

---------- Forwarded message ----------
From: Rich Cooper <rich at englishlogickernel.com>
Date: Mon, Jul 4, 2011 at 8:53 PM
Subject: Re: [Corpora-List] discussion on reproducibility at ACL 2011
business meeting
To: Ruvan Weerasinghe <arw at ucsc.cmb.ac.lk>, tpederse at d.umn.edu
Cc: nlpatumd at yahoogroups.com, corpora at uib.no

** ** ** ** **

Hi Ruvan,****

** **

Generally, publishing academic code is a good idea, but publishing real code
isn’t feasible.  If the code is small enough to publish, it is merely an
aspect of the full requirements for linguistic interaction.  But even
programs of less than a hundred thousand lines simply don’t have that much
functionality.  ****

** **

However, publishing code snippets, like the Link Grammar developers did,
would be very useful.  In the LG case, the publications included a very
clear exposition of how constraint propagation can be applied to simple
context free grammars to cover, at most, ten percent of the kind of
linguistic conversations that are required.  But the abstraction is limited
to parsing, not to linguistic interaction.  That made the complexity of the
ideas match the originality of the published procedures.  ****

** **

I agree that publishing such abstraction snippets is a good idea, but only
for appropriate levels of detail.  Beyond that, it gets unreasonably
complicated for others to learn from effectively.  ****

** **

Abstractions teach.  But full code publication merely confuses.  So it
should be a question of which papers should contain published algorithms to
demonstrate simple slices of a full system.  ****

** **

-Rich****

** **

Sincerely,****

Rich Cooper****

EnglishLogicKernel.com****

Rich AT EnglishLogicKernel DOT com****

9 4 9 \ 5 2 5 - 5 7 1 2****
  ------------------------------

*From:* corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On Behalf Of
*Ruvan Weerasinghe
*Sent:* Sunday, July 03, 2011 6:47 PM
*To:* tpederse at d.umn.edu
*Cc:* nlpatumd at yahoogroups.com; corpora at uib.no
*Subject:* Re: [Corpora-List] discussion on reproducibility at ACL 2011
business meeting****

** **

May be we can address some of the issues raised by talking to the Biology
(Bioinformatics) people who seem to make publishing data and code a
precondition for publication?

Regards.

Ruvan Weerasinghe
University of ****Colombo** **School**** of Computing
****Colombo**** 00700,
****Sri Lanka****.

Web:    http://www.ucsc.lk
Phone:  +94112158953; Fax:    +94112587239****
 ------------------------------

*From: *"Ted Pedersen" <tpederse at d.umn.edu>
*To: *corpora at uib.no
*Cc: *nlpatumd at yahoogroups.com
*Sent: *Sunday, July 3, 2011 11:40:05 PM
*Subject: *[Corpora-List] discussion on reproducibility at ACL 2011
business        meeting

Greetings all,****

** **

I made a few remarks during the ACL 2011 business meeting in favor of the
innovation this year on allowing submissions of data and code along with
paper submissions. I suggested this is something we want to continue and
encourage, particularly for papers submitted to the empirical track at ACL
(which is the majority of papers these days) so that we might be able to
reproduce results more easily. I had some slides prepared that I didn't use,
but I've put those here that summarize part of what I said at least (I
forgot a few points, but the gist is fairly consistent I guess...):****

** **

http://www.slideshare.net/duluthted/pedersen-acl2011businessmeeting****

** **

There were quite a few comments thereafter and I took a few notes, and I
guess I thought it would be possibly useful to preserve these "for the
record" at least, since I think that discussion raised many of the common
concerns about this issue. It might also be an opportunity for folks to
follow up or at least continue thinking. ****

** **

Below are the comments, approximately in the order made....note that I'm
trying here to simply reproduce the gist of comments, and not offer any
opinion on them. I think it was great there was such an extensive
discussion, and I guess I just wanted to note that and preserve it as best I
could. If anyone feels like they have been misquoted, forgotten, or
misunderstood, please feel free to jump in and elaborate. ****

** **

0) Speaker was in support of the encouraging more submissions of code and
data, and noted that he was happy to see quite a few presentations at ACL
where code and data were being made available. ****

** **

1) Data is sometimes expensive to create (especially speech data) and
releasing it after one publication may not be in the best interests of the
creators.****

** **

2) Reviewing code is time consuming (and another concern raised during the
business meeting was reviewer overload, so this certainly fit into that
theme).****

** **

3) It is often hard or impossible for people in industrial settings to
release code - the licensing issues are sometimes very complex and would
need to be resolved before any code was submitted.****

** **

4) There could be a prize offered for the best code / best data submitted .
****

** **

5) It is hard to know how to review software.****

** **

6) Maybe software could be made available on an ACL cloud, in order to solve
some licensing concerns (especially of industry)****

** **

7) Code at submission time is very hard to anonymize - maybe we
need separate reviewers for code and data (from paper).****

** **

8) Simply releasing or submitting code isn't necessarily useful (if it is
bad code). How do we make sure the code is of high quality and/or useful?***
*

** **

9) There is a tension between having new and exciting ideas and producing
well engineered code. Put another way, there's a tension between pushing the
envelope and playing it safe. The speaker was concerned we might be moving
too far away from encouraging new ideas. ****

** **

10) Releasing code will in the end help the impact of work. If you look at
high impact work in our field, it often centers around a resource (eg Penn
Treebank). Releasing code can also help people in industry, because
sometimes publishing code is the only way that it will ever get out (eg
sentence alignment code from CL in 1993 by Gale and Church)****

** **

11) Have a retroactive prize after a few years for software systems that are
released and are proven to have some impact.****

** **

12) During the discussion of the new journal, it was mentioned that maybe
that could be a vehicle for releasing code and data. ****

** **

I'm grateful that the ACL opened up the business meeting to these kinds of
remarks, and really appreciate both the opportunity to say a few words, and
also hear all these different views. It's given me a lot to think about, and
I just wanted to pass along my notes in the hopes of encouraging others to
do the same. Keep talking. :) ****

** **

Enjoy,****

Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse****

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora****

** **

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20110708/9cc5824c/attachment.html>