[openbiblio-dev] github

Peter Murray-Rust pm286 at cam.ac.uk
Thu Nov 17 09:52:00 UTC 2011


I have widened this out to openblio-dev.

Yesterday Mark, Jim, Ed Chamberlain and me attended the openbiblio skype (
http://openbiblio.okfnpad.org/catchup? ) and one of the ideas that came up
was of using GitHub as a way of managing BibSoups. The idea is, I think,
that each BibSoup has a Bibserver (probably running in a cloud somewhere)
and that the raw content of the BibSoup is managed by Github. I include the
last mail below for more detail.

Github seems to me a wonderful idea - assuming the technology works. Github
is more than simply a distributed repository of content as it also provides
a social framework (and people have proposed a Github of science, for
example).

I haven't used Git much myself (we use mercurial for source code) but one
of the papers in the PMR symposium was communally authored in Latex on
Github.

I am very enthusiastic about the idea of using a source code management
(SCM) system. Given that onur material is JSON there should be a lot of
native support for that in Github. Bibliography refactoring is going to be
similar to code - the bulk of the material is unchanged ASCII and there
will be minor changes. (It's only when you sort or drastically change every
entry that it struggles). This is a real win for us! It may also be that we
can develop tools that run under systems such as eclipse.

P.

On Wed, Nov 16, 2011 at 10:49 PM, Jim Pitman <pitman at stat.berkeley.edu>wrote:

> Mark, I reply with copy to Thomas Krichel who one of our primary
> suppliers of large datasets, for his reactions/suggestions.
>
> > >> Here are details about using a repo:
> > >>
> http://bibserver.okfn.org/howto/use-git-to-collaboratively-work-on-bibliographic-collections/
> > >
>
> Thomas, I'd be very interested in your reaction to Mark's suggestion to
> use github as a data repo as well as a code repo for BKN efforts. It seems
> that it should not be hard to base RePEc like aggregations on a collection
> of biblio datasets maintained on  Github by distributed owners. The
> advantage
> of GitHub seems to be its version control, capability to provide shared
> ownership of a dataset, and the fact that anyone can upload to github from
> any computer, they dont need to manage or have access to a server to post
> their stuff.
>
> Mark, I am also interested in the issue that after you have expanded
> a biblio data collection to index it with elastic search, you have some
> large
> index file somewhere, over which the BibServer runs.
> Am I right that it is possible to clearly separate the state of the
> elastic search cache from the BibServer which runs over it?
> So we can think of the BibServer as a tool for accessing this cache, but
> that the same cache might also be accessed by other software systems?
> I'd like to understand this better.
> How big do you anticipate these search index caches getting? We should
> perhaps think about saving the whole index to preserve snapshots of the
> state of a BibServer index.  This is not an immediate concern:
> short term we can rely on individuals managing their own data, and make no
> promises about stability of BibServer aggregation nodes. But I think this
> is
> potentially an issue down the road.
>
> Thomas, do you do snapshots of RePEc? If so how large are they?
> Or do rely on continuous updates from the distributed repos?
>
> > > Do you see any issues about size of files posted to github?
> > > It would be best to know about those in advance.
> >
> > I have not found anything official on it, but there are people
> > discussing problems when trying to upload 750MB - so we should have a
> > long way to go before it becomes a problem.
>
> Should keep us going for a while.
>
> --Jim
>
> ----------------------------------------------
> Jim Pitman
> Professor of Statistics and Mathematics
> University of California
> 367 Evans Hall # 3860
> Berkeley, CA 94720-3860
>
> ph: 510-642-9970  fax: 510-642-7892
> e-mail: pitman at stat.berkeley.edu
> URL: http://www.stat.berkeley.edu/users/pitman
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20111117/7ac1c9f0/attachment-0001.htm>


More information about the openbiblio-dev mailing list