[okfn-discuss] APIs vs bulk data, in the face of a government shutdown
William Waites
ww at eris.okfn.org
Thu Oct 3 18:37:11 UTC 2013
On Thu, 3 Oct 2013 17:41:47 +0100, Rufus Pollock <rufus.pollock at okfn.org> said:
> And http://data.okfn.org/ which went live in April runs purely
> off data package datasets stored in git repos on github (at
> http://github.com/datasets)
I think I saw this pass by at the time. I wonder how this would work
for datasets that are volatile. The obvious thing to do when there's a
new revision of the dataset would be to patch the repository, just a
normal commit. If the data publishers do this it makes life very easy
for them because they only have to transmit a diff and some metadata.
On the other hand if the data changes frequently then it might push
the limits of the git data-structures.
Imagine a repository consisting of tide measurements (something we
would like to have and is difficult to get at now) for a few hundred
sites around the UK. One file per site with that day's data. Every day
there would be a commit, and yesterday's data gets archived in the
change history. This could be a lot of commits but probably not more
than a big free software project like the linux kernel. But to do
time-series analysis on the data you now have to walk the repository
history which is potentially expensive. Is this a good idea?
-w
More information about the okfn-discuss
mailing list