[okfn-discuss] APIs vs bulk data, in the face of a government shutdown

Thu Oct 3 18:37:11 UTC 2013

On Thu, 3 Oct 2013 17:41:47 +0100, Rufus Pollock <rufus.pollock at okfn.org> said:

    > And http://data.okfn.org/ which went live in April runs purely
    > off data package datasets stored in git repos on github (at
    > http://github.com/datasets)

I think I saw this pass by at the time. I wonder how this would work
for datasets that are volatile. The obvious thing to do when there's a
new revision of the dataset would be to patch the repository, just a
normal commit. If the data publishers do this it makes life very easy
for them because they only have to transmit a diff and some metadata. 
On the other hand if the data changes frequently then it might push
the limits of the git data-structures.

Imagine a repository consisting of tide measurements (something we
would like to have and is difficult to get at now) for a few hundred
sites around the UK. One file per site with that day's data. Every day
there would be a commit, and yesterday's data gets archived in the
change history. This could be a lot of commits but probably not more
than a big free software project like the linux kernel. But to do
time-series analysis on the data you now have to walk the repository
history which is potentially expensive. Is this a good idea?

-w