[openbiblio-dev] github

ianibbo at gmail.com ianibbo at gmail.com
Thu Nov 17 10:36:14 UTC 2011

(With apols to PMR for duplicate, forgot to cc list first time)

This is vaguely reminiscent of a discussion I was having with Owen
Stepehens on twitter (And in person with Tony H last weekend)  about a
bib-hub style service... It's not clear to me from your text exactly
what features of github you're talking about exploiting, but I think
there's a massive parallel between some of the big-scale issues we
have with bib data management and git style branch/merge. I suspect it
would be missing a huge trick if you were to treat git as essentially
"Just another bitstore" and not look carefully at it's workflow.

I also think this points back to a discussion on this list that
questioned the wisdom of creating yet-another ultimate
copy/aggregation of bib data when what we really need to do is better
manage the collections already out there.


On 17 November 2011 09:52, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
> I have widened this out to openblio-dev.
> Yesterday Mark, Jim, Ed Chamberlain and me attended the openbiblio skype
> (http://openbiblio.okfnpad.org/catchup? ) and one of the ideas that came up
> was of using GitHub as a way of managing BibSoups. The idea is, I think,
> that each BibSoup has a Bibserver (probably running in a cloud somewhere)
> and that the raw content of the BibSoup is managed by Github. I include the
> last mail below for more detail.
> Github seems to me a wonderful idea - assuming the technology works. Github
> is more than simply a distributed repository of content as it also provides
> a social framework (and people have proposed a Github of science, for
> example).
> I haven't used Git much myself (we use mercurial for source code) but one of
> the papers in the PMR symposium was communally authored in Latex on Github.
> I am very enthusiastic about the idea of using a source code management
> (SCM) system. Given that onur material is JSON there should be a lot of
> native support for that in Github. Bibliography refactoring is going to be
> similar to code - the bulk of the material is unchanged ASCII and there will
> be minor changes. (It's only when you sort or drastically change every entry
> that it struggles). This is a real win for us! It may also be that we can
> develop tools that run under systems such as eclipse.
> P.
> On Wed, Nov 16, 2011 at 10:49 PM, Jim Pitman <pitman at stat.berkeley.edu>
> wrote:
>> Mark, I reply with copy to Thomas Krichel who one of our primary
>> suppliers of large datasets, for his reactions/suggestions.
>> > >> Here are details about using a repo:
>> > >>
>> > >> http://bibserver.okfn.org/howto/use-git-to-collaboratively-work-on-bibliographic-collections/
>> > >
>> Thomas, I'd be very interested in your reaction to Mark's suggestion to
>> use github as a data repo as well as a code repo for BKN efforts. It seems
>> that it should not be hard to base RePEc like aggregations on a collection
>> of biblio datasets maintained on  Github by distributed owners. The
>> advantage
>> of GitHub seems to be its version control, capability to provide shared
>> ownership of a dataset, and the fact that anyone can upload to github from
>> any computer, they dont need to manage or have access to a server to post
>> their stuff.
>> Mark, I am also interested in the issue that after you have expanded
>> a biblio data collection to index it with elastic search, you have some
>> large
>> index file somewhere, over which the BibServer runs.
>> Am I right that it is possible to clearly separate the state of the
>> elastic search cache from the BibServer which runs over it?
>> So we can think of the BibServer as a tool for accessing this cache, but
>> that the same cache might also be accessed by other software systems?
>> I'd like to understand this better.
>> How big do you anticipate these search index caches getting? We should
>> perhaps think about saving the whole index to preserve snapshots of the
>> state of a BibServer index.  This is not an immediate concern:
>> short term we can rely on individuals managing their own data, and make no
>> promises about stability of BibServer aggregation nodes. But I think this
>> is
>> potentially an issue down the road.
>> Thomas, do you do snapshots of RePEc? If so how large are they?
>> Or do rely on continuous updates from the distributed repos?
>> > > Do you see any issues about size of files posted to github?
>> > > It would be best to know about those in advance.
>> >
>> > I have not found anything official on it, but there are people
>> > discussing problems when trying to upload 750MB - so we should have a
>> > long way to go before it becomes a problem.
>> Should keep us going for a while.
>> --Jim
>> ----------------------------------------------
>> Jim Pitman
>> Professor of Statistics and Mathematics
>> University of California
>> 367 Evans Hall # 3860
>> Berkeley, CA 94720-3860
>> ph: 510-642-9970  fax: 510-642-7892
>> e-mail: pitman at stat.berkeley.edu
>> URL: http://www.stat.berkeley.edu/users/pitman
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> openbiblio-dev mailing list
> openbiblio-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openbiblio-dev

Ian Ibbotson
W: http://ianibbo.me
E: ianibbo at gmail.com
skype: ianibbo
twitter: ianibbo

More information about the openbiblio-dev mailing list