[okfn-discuss] Public mirrors of Open Content archives

Tue Apr 8 10:48:17 UTC 2008

On 07/04/08 18:06, Evan Prodromou wrote:
> My name is Evan Prodromou, and I'm the founder of a wiki Web site called
> Vinismo.com. Vinismo is the Free wine guide -- we have more than 20,000
> articles on wines, wineries, and wine regions around the world.
> 
>     http://vinismo.com/

Really nice -- I took a look yesterday after seeing it entered on CKAN.

> Our site is available under the Creative Commons Attribution-ShareAlike
> Canada 2.5 license. We make full-site data dumps available in HTML and
> XML format, as well as tarballs for our image files:
> 
>        http://vinismo.com/download/

This is great. And makes you ultra-compliant with the OK/DD 
(http://opendefinition.org/) :)

> We also build some RDF data about wines (it's a red wine, it contains
> such-and-such a grape, it costs so many dollars in the US or Australia
> or Canada or wherever, etc.), which can be derived from the above dumps,
> but which we'll soon be providing in a separate data dump, too.
> 
> My plan is to have vinismo.com alive and well forever and ever. But
> plans don't always work out as we'd like. It would be great for my peace
> of mind, and for all our contributors, to know that our data was being
> archived somewhere safe and independent from the Vinismo site. Making Open
> Content, it's important to make sure it sticks around for someone else
> to use.

Absolutely. This is a major issue and something that's been thought 
about for a while -- though largely in related to general questions of 
distributed data and data distribution.

> Let me be clear that my bandwidth and storage needs are fine, and that I
> of course have redundant off-site backups. This kind of mirroring is 
> more of
> a social issue than a technical one. Mirrors like this are common in the 
> Free
> Software world.

Yes. The traditional issue with 'knowledge' be it content, data or 
otherwise is its size relative to software: most software projects run 
to a few megs of source and even the large ones rarely get above a GB. 
By contrast content can easily get very large and even textual content 
can get over a GB pretty easily (Freebase's full WEX dump  of wikipedia 
is 8.4GB compressed for exapmle).

> I'd hoped that this was what CKAN was about, but apparently it's more of a
> directory than an archive network. Is there any service out there that 
> mirrors
> Open Knowledge data sets?

CKAN is about that but indirectly. If you think about what you need to 
data componentization (or federation and replication which is more what 
you are interested in) you need first to have a registry of some kind 
that stores the basic metadata and acts as a lookup-hub (note that this 
service itself can be decentralized and replicated). That is what CKAN 
is for.

CKAN, as you have correctly surmized, is not for holding the dumps 
themselves but instead provides the 'download_url' field which could be 
used either to point to a page with links (perhaps to the various 
mirrors), a single downloadable file, a torrent file etc. (What we need 
next is code to interact with CKAN in the way that 
easy_install/setuptools does with pypi, or perl does with CPAN. This is 
currently in progress as datapkg [1]).

[1]: http://knowledgeforge.net/ckan/svn/datapkg/trunk/

However none of these points answers your question which is where can I 
dump backup files of the project? Or to put it in a wider perspective 
where are the mirror networks for open knowledge that exist for F/OSS? 
The answer at present is not that many that I know of (inside academia 
there seems to be some degree of work on this in relation to Grid-type 
activities). The two immediate options I can think of are:

   * http://www.archive.org/ It appears that the Internet Archive are 
currently happy to host datasets and are pretty unlimited in capacity.
   * http://knowledgeforge.net/ This is a complementor project to CKAN 
also run by us. It was primarily designed for actual project development 
but you could use it simply to store dumps (i.e. uploading via DAV or 
via SVN). The storage on this is not unlimited.

Finally, given the lack of anything that exactly fits what is needed 
maybe we should think about putting something simple together which can 
aggregate storage (everybody gives 10GB ...) and thereby provide a 
simple mirror network. There seem plenty of basic off-the-shelf 
components that could be used for this (see [2] [3] for summary and links).

~rufus

[2]: http://okfn.org/wiki/DataDistribution
[3]: http://okfn.org/wiki/ToolsWeNeed