[okfn-discuss] Public mirrors of Open Content archives
Rufus Pollock
rufus.pollock at okfn.org
Tue Apr 8 10:48:17 UTC 2008
On 07/04/08 18:06, Evan Prodromou wrote:
> My name is Evan Prodromou, and I'm the founder of a wiki Web site called
> Vinismo.com. Vinismo is the Free wine guide -- we have more than 20,000
> articles on wines, wineries, and wine regions around the world.
>
> http://vinismo.com/
Really nice -- I took a look yesterday after seeing it entered on CKAN.
> Our site is available under the Creative Commons Attribution-ShareAlike
> Canada 2.5 license. We make full-site data dumps available in HTML and
> XML format, as well as tarballs for our image files:
>
> http://vinismo.com/download/
This is great. And makes you ultra-compliant with the OK/DD
(http://opendefinition.org/) :)
> We also build some RDF data about wines (it's a red wine, it contains
> such-and-such a grape, it costs so many dollars in the US or Australia
> or Canada or wherever, etc.), which can be derived from the above dumps,
> but which we'll soon be providing in a separate data dump, too.
>
> My plan is to have vinismo.com alive and well forever and ever. But
> plans don't always work out as we'd like. It would be great for my peace
> of mind, and for all our contributors, to know that our data was being
> archived somewhere safe and independent from the Vinismo site. Making Open
> Content, it's important to make sure it sticks around for someone else
> to use.
Absolutely. This is a major issue and something that's been thought
about for a while -- though largely in related to general questions of
distributed data and data distribution.
> Let me be clear that my bandwidth and storage needs are fine, and that I
> of course have redundant off-site backups. This kind of mirroring is
> more of
> a social issue than a technical one. Mirrors like this are common in the
> Free
> Software world.
Yes. The traditional issue with 'knowledge' be it content, data or
otherwise is its size relative to software: most software projects run
to a few megs of source and even the large ones rarely get above a GB.
By contrast content can easily get very large and even textual content
can get over a GB pretty easily (Freebase's full WEX dump of wikipedia
is 8.4GB compressed for exapmle).
> I'd hoped that this was what CKAN was about, but apparently it's more of a
> directory than an archive network. Is there any service out there that
> mirrors
> Open Knowledge data sets?
CKAN is about that but indirectly. If you think about what you need to
data componentization (or federation and replication which is more what
you are interested in) you need first to have a registry of some kind
that stores the basic metadata and acts as a lookup-hub (note that this
service itself can be decentralized and replicated). That is what CKAN
is for.
CKAN, as you have correctly surmized, is not for holding the dumps
themselves but instead provides the 'download_url' field which could be
used either to point to a page with links (perhaps to the various
mirrors), a single downloadable file, a torrent file etc. (What we need
next is code to interact with CKAN in the way that
easy_install/setuptools does with pypi, or perl does with CPAN. This is
currently in progress as datapkg [1]).
[1]: http://knowledgeforge.net/ckan/svn/datapkg/trunk/
However none of these points answers your question which is where can I
dump backup files of the project? Or to put it in a wider perspective
where are the mirror networks for open knowledge that exist for F/OSS?
The answer at present is not that many that I know of (inside academia
there seems to be some degree of work on this in relation to Grid-type
activities). The two immediate options I can think of are:
* http://www.archive.org/ It appears that the Internet Archive are
currently happy to host datasets and are pretty unlimited in capacity.
* http://knowledgeforge.net/ This is a complementor project to CKAN
also run by us. It was primarily designed for actual project development
but you could use it simply to store dumps (i.e. uploading via DAV or
via SVN). The storage on this is not unlimited.
Finally, given the lack of anything that exactly fits what is needed
maybe we should think about putting something simple together which can
aggregate storage (everybody gives 10GB ...) and thereby provide a
simple mirror network. There seem plenty of basic off-the-shelf
components that could be used for this (see [2] [3] for summary and links).
~rufus
[2]: http://okfn.org/wiki/DataDistribution
[3]: http://okfn.org/wiki/ToolsWeNeed
More information about the okfn-discuss
mailing list