[openbiblio-dev] github

Thu Nov 17 21:14:29 UTC 2011

On 17 November 2011 09:52, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
> I have widened this out to openblio-dev.
>
> Yesterday Mark, Jim, Ed Chamberlain and me attended the openbiblio skype
> (http://openbiblio.okfnpad.org/catchup? ) and one of the ideas that came up
> was of using GitHub as a way of managing BibSoups. The idea is, I think,
> that each BibSoup has a Bibserver (probably running in a cloud somewhere)
> and that the raw content of the BibSoup is managed by Github. I include the
> last mail below for more detail.

I think you can do a find and replace throughout your text of github
with git :-) Mostly you seem to be talking about versioned storage for
material / data. (Or is there something specific to github?).

> Github seems to me a wonderful idea - assuming the technology works. Github
> is more than simply a distributed repository of content as it also provides
> a social framework (and people have proposed a Github of science, for
> example).

The social framework is also what gives it lockin power ..

Off-topic: What exactly would that mean: "a github for science" (other
than: distributed collaboratoin is cool and tools that enable it are
cool :-) -- we all want this but its hard (thedatahub.org is trying to
do this right in part).

Also, again do you mean github (hosted 'social' service with good UX)
or git (distributed revision control)?

> I haven't used Git much myself (we use mercurial for source code) but one of
> the papers in the PMR symposium was communally authored in Latex on Github.

Right but nothing specific to github (though it is nice). Back in the
day we wrote papers with svn though the distributed revision control
of git / hg / ... make this easier in a fundamental way.

> I am very enthusiastic about the idea of using a source code management
> (SCM) system. Given that onur material is JSON there should be a lot of

I'm not very enthusiastic, at least not yet. What is it buying you?
Line separated text (what git / hg/ etc operate on) is fundamentally
different from structured data. You can fit a square peg into a round
whole and if the round whole is sufficently wonderful in some way it
may be worth it but I'm dubious. Bibliographic data *is* data and has
structure different from line-oriented text. More on this, see bottom
part of this recent post:

<http://rufuspollock.org/2011/10/17/weekly-update-rufus-pollock-2/>

and some of its refs e.g.

<http://blog.okfn.org/2010/07/12/we-need-distributed-revisionversion-control-for-data/>

Maybe git / hg (and text based tools generally) are just are so good
that this is worth doing (as i think etienne suggests) but I think
we'd need to think hard about this IMO.

> native support for that in Github. Bibliography refactoring is going to be
> similar to code - the bulk of the material is unchanged ASCII and there will
> be minor changes. (It's only when you sort or drastically change every entry

OK, this is the crucial claim. Distilled question:

How well does bibliogrpahic data serialize to line separated text?

IMO not that well: biblio stuff has references whose integrity matters
(between records, between records and collections, between records and
entities (people)), it has structured fields etc etc.

I may be wrong here and there are substantial attractions of tool
chain (as Etienne alludes to) if we can do this in the form of git
etc.

> that it struggles). This is a real win for us! It may also be that we can
> develop tools that run under systems such as eclipse.

I'm not sure eclipse is a benefit -- I mean we can always serialize to
text right now and use whatever editor takes your fancy. The real
benefit is version control and especially distributed version control.
That fundamentally depends on how well your structure serializes to
line broken text (plus size and referentiality).

Rufus

> On Wed, Nov 16, 2011 at 10:49 PM, Jim Pitman <pitman at stat.berkeley.edu>
> wrote:
>>
>> Mark, I reply with copy to Thomas Krichel who one of our primary
>> suppliers of large datasets, for his reactions/suggestions.
>>
>> > >> Here are details about using a repo:
>> > >>
>> > >> http://bibserver.okfn.org/howto/use-git-to-collaboratively-work-on-bibliographic-collections/
>> > >
>>
>> Thomas, I'd be very interested in your reaction to Mark's suggestion to
>> use github as a data repo as well as a code repo for BKN efforts. It seems
>> that it should not be hard to base RePEc like aggregations on a collection
>> of biblio datasets maintained on  Github by distributed owners. The
>> advantage
>> of GitHub seems to be its version control, capability to provide shared
>> ownership of a dataset, and the fact that anyone can upload to github from
>> any computer, they dont need to manage or have access to a server to post
>> their stuff.
>>
>> Mark, I am also interested in the issue that after you have expanded
>> a biblio data collection to index it with elastic search, you have some
>> large
>> index file somewhere, over which the BibServer runs.
>> Am I right that it is possible to clearly separate the state of the
>> elastic search cache from the BibServer which runs over it?
>> So we can think of the BibServer as a tool for accessing this cache, but
>> that the same cache might also be accessed by other software systems?
>> I'd like to understand this better.
>> How big do you anticipate these search index caches getting? We should
>> perhaps think about saving the whole index to preserve snapshots of the
>> state of a BibServer index.  This is not an immediate concern:
>> short term we can rely on individuals managing their own data, and make no
>> promises about stability of BibServer aggregation nodes. But I think this
>> is
>> potentially an issue down the road.
>>
>> Thomas, do you do snapshots of RePEc? If so how large are they?
>> Or do rely on continuous updates from the distributed repos?
>>
>> > > Do you see any issues about size of files posted to github?
>> > > It would be best to know about those in advance.
>> >
>> > I have not found anything official on it, but there are people
>> > discussing problems when trying to upload 750MB - so we should have a
>> > long way to go before it becomes a problem.
>>
>> Should keep us going for a while.
>>
>> --Jim
>>
>> ----------------------------------------------
>> Jim Pitman
>> Professor of Statistics and Mathematics
>> University of California
>> 367 Evans Hall # 3860
>> Berkeley, CA 94720-3860
>>
>> ph: 510-642-9970  fax: 510-642-7892
>> e-mail: pitman at stat.berkeley.edu
>> URL: http://www.stat.berkeley.edu/users/pitman
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
> _______________________________________________
> openbiblio-dev mailing list
> openbiblio-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openbiblio-dev
>
>

-- 
Co-Founder, Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/