[open-science] OKF tools: ckan.org, thedatahub.org

Wed Apr 4 18:32:24 UTC 2012

Peter's right about metadata. One of the hardest nuts to crack re:
data repositories and making data more discoverable seems to be
metadata--how do we get scientists to supply metadata without creating
more work for them, and how do we easily convert subject specific
metadata formats into more general formats (or vice versa).

Creating a tool (or building the capability into a larger tool) that
would automate any part of that process for researchers in a specific
subject area(s) would be quite valuable. Better crowdsourcing tools,
as Peter points out, could also be a good lead. Or maybe building that
mythical add-on to scientific instruments that would automatically
extract relevant metadata (which William Gunn talked about in his Open
Science Summit 2012 talk) could be a good direction to move?

Or perhaps a discipline-specific virtual archive (a la SEAD
http://sead-data.net/) that would pull datasets from a variety of
repositories (national, institutional, subject repos, etc) in an
automated fashion, to centralize research and build upon existing data
(and allow other researchers to add their own metadata that's relevant
to their discipline)? If we were to build upon existing repositories
in a novel way, and do it in a manner that was less hands-on than
SEAD, it would address the above NSF C4P in a few ways. Also, if we
could build an open tool that others could easily adapt for other
disciplines, it would be a pretty Big Deal.

On Wed, Apr 4, 2012 at 3:10 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>
>
> On Wed, Apr 4, 2012 at 2:56 AM, Carl Boettiger <cboettig at gmail.com> wrote:
>>
>> Peter is certainly right about the huge & growing demand for data
>> repositories.  Forgive the cut & paste below, but for US based among us at
>> least, the NSF has just put out a call for small ($250K) and medium ($1M)
>> grants to do something like this.  In my opinion it's not a bad checklist.
>> Of course as this thread has already highlighted, half of the battle is just
>> knowing what's already out there and getting it adopted...
>>
> Great. Presumably these have to be held by a US PI but there is no reason
> why the OKF should not contract to the PI. Carl - are you planning to apply?
>
> I think of companies likeKitware - an Open Source company for whom I have
> great respect. They are the sort of org that I think the NSF might be
> thinking about. But I think we should also try to identify other US
> contacts. My comments on the material below
>
>>
>> E-science collaboration environments (ESCE). A comprehensive "big data"
>> cyberinfrastructure is necessary to allow for broad communities of
>> scientists and engineers to have access to diverse data and to the best and
>> most usable inferential and visualization tools. Potential research areas
>> include, but are not limited to:
>>
>> Novel collaboration environments for diverse and distant groups of
>> researchers and students to coordinate their work (e.g., through data and
>> model sharing and software reuse, tele-presence capability, crowdsourcing,
>> social networking capabilities) with greatly enhanced efficiency and
>> effectiveness for the scientific collaboration;
>
> OKF leads the world here. BTW for those of you not in "grid-based" computing
> there is a huge need for social networking. I went to India for EU-India
> grid and wrote a position paper about how it should be a social as well as a
> technical grid
>>
>> Automation of the discovery process (e.g., through machine learning, data
>> mining, and automated inference);
>
>  Metadata => CKAN.
>>
>> Automated modeling tools to provide multiple views of massive data sets
>> that are useful to diverse disciplines;
>
> I am more concerned with the views on massive numbers of medium size data
> sets
>>
>> New data curation techniques for managing the complex and large flow of
>> scientific output in a multitude of disciplines;
>
> Possibly. IMO the problems are social first and technical second
>>
>> Development of systems and processes that efficiently incorporate
>> autonomous anomaly and trend detection with human interaction, response, and
>> reaction;
>
> Yes. I did something on a very small scale with high-throughput
> computational chemistry
>>
>> End-to-end systems that facilitate the development and use of scientific
>> workflows and new applications;
>
> This has been a rathole for many years. Tom Oinn and I have worked on
> Taverna - the Mancester eScience workflow engine. I've developed my own for
> chemistry. But there is no one-size-fits-all
>>
>> New approaches to development of research questions that might be pursued
>> in light of access to heterogeneous, diverse, big data;
>
> When we get the Open data there will be no shortage of people doing this!
>>
>> New models for cross-disciplinary information fusion and knowledge
>> sharing;
>
> This again comes down to metadata and social aspects
>>
>> New approaches for effective data, knowledge, and model sharing and
>> collaboration across multiple domains and disciplines;
>
> The first new approach is to recognize it needs funding for sustainability!
>>
>> Securing access to data using innovative techniques to prevent excessive
>> replication of data to external entities;
>
> Yes. This is 95% social
>>
>> Providing secure and controlled role-based access to centrally managed
>> data environments;
>
> Expensive and normally only possible for the people with the data. I pass on
> these issues
>>
>> Remote operation, scheduling, and real-time access to distant instruments
>> and data resources;
>
> This is quite well advanced. It's big science
>>
>> Protection of privacy and maintenance of security in aggregated personal
>> and proprietary data (e.g., de-identification);
>
> Pass
>>
>> Generation of aggregated or summarized data sets for sharing and analyses
>> across jurisdictional and other end user boundaries; and
>
> Again the OKF could tackle this
>>
>> E-publishing tools that provide unique access, learning, and development
>> opportunities.
>>
>>
>
>
> And this is a major area. There is a desperate need for Open tools. But it
> needs a lot of work
>
> In Summary:
> My own emphasis would be to stress the "long-tail" science. Big science
> always gets the money. This should be a long-tail charter.
>
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>