pm286 at cam.ac.uk
Thu Nov 17 09:52:00 UTC 2011
I have widened this out to openblio-dev.
Yesterday Mark, Jim, Ed Chamberlain and me attended the openbiblio skype (
http://openbiblio.okfnpad.org/catchup? ) and one of the ideas that came up
was of using GitHub as a way of managing BibSoups. The idea is, I think,
that each BibSoup has a Bibserver (probably running in a cloud somewhere)
and that the raw content of the BibSoup is managed by Github. I include the
last mail below for more detail.
Github seems to me a wonderful idea - assuming the technology works. Github
is more than simply a distributed repository of content as it also provides
a social framework (and people have proposed a Github of science, for
I haven't used Git much myself (we use mercurial for source code) but one
of the papers in the PMR symposium was communally authored in Latex on
I am very enthusiastic about the idea of using a source code management
(SCM) system. Given that onur material is JSON there should be a lot of
native support for that in Github. Bibliography refactoring is going to be
similar to code - the bulk of the material is unchanged ASCII and there
will be minor changes. (It's only when you sort or drastically change every
entry that it struggles). This is a real win for us! It may also be that we
can develop tools that run under systems such as eclipse.
On Wed, Nov 16, 2011 at 10:49 PM, Jim Pitman <pitman at stat.berkeley.edu>wrote:
> Mark, I reply with copy to Thomas Krichel who one of our primary
> suppliers of large datasets, for his reactions/suggestions.
> > >> Here are details about using a repo:
> > >>
> > >
> Thomas, I'd be very interested in your reaction to Mark's suggestion to
> use github as a data repo as well as a code repo for BKN efforts. It seems
> that it should not be hard to base RePEc like aggregations on a collection
> of biblio datasets maintained on Github by distributed owners. The
> of GitHub seems to be its version control, capability to provide shared
> ownership of a dataset, and the fact that anyone can upload to github from
> any computer, they dont need to manage or have access to a server to post
> their stuff.
> Mark, I am also interested in the issue that after you have expanded
> a biblio data collection to index it with elastic search, you have some
> index file somewhere, over which the BibServer runs.
> Am I right that it is possible to clearly separate the state of the
> elastic search cache from the BibServer which runs over it?
> So we can think of the BibServer as a tool for accessing this cache, but
> that the same cache might also be accessed by other software systems?
> I'd like to understand this better.
> How big do you anticipate these search index caches getting? We should
> perhaps think about saving the whole index to preserve snapshots of the
> state of a BibServer index. This is not an immediate concern:
> short term we can rely on individuals managing their own data, and make no
> promises about stability of BibServer aggregation nodes. But I think this
> potentially an issue down the road.
> Thomas, do you do snapshots of RePEc? If so how large are they?
> Or do rely on continuous updates from the distributed repos?
> > > Do you see any issues about size of files posted to github?
> > > It would be best to know about those in advance.
> > I have not found anything official on it, but there are people
> > discussing problems when trying to upload 750MB - so we should have a
> > long way to go before it becomes a problem.
> Should keep us going for a while.
> Jim Pitman
> Professor of Statistics and Mathematics
> University of California
> 367 Evans Hall # 3860
> Berkeley, CA 94720-3860
> ph: 510-642-9970 fax: 510-642-7892
> e-mail: pitman at stat.berkeley.edu
> URL: http://www.stat.berkeley.edu/users/pitman
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the openbiblio-dev