[okfn-discuss] Open Economics and Data Packaging: some thoughts and questions
Rufus Pollock
rufus.pollock at okfn.org
Fri Jan 11 14:35:13 UTC 2008
One part of our Open Economics project (http://www.openeconomics.net/)
involves maintaining a data store:
http://www.openeconomics.net/store/
Corresponding data is stored in subversion:
http://knowledgeforge.net/econ/svn/trunk/data/
Actually having to store some data (in a simple way) has been a great
exercise in terms of thinking about how one does this kind of thing. In
keeping with the KISS approach basic structure of the data sets (or
'bundles' as I have termed them) is:
* metadata.txt: follows ini style conventions
* data itself stored either:
* plain csv: data.csv
* and/or script file of some kind: data.py (which helps generate
data.csv in some way via downloading, parsing etc etc).
To illustrate one dataset we've recently been working to include is that
from Millenium Development Goals Project
<http://mdgs.un.org/unsd/mdg/>
The corresponding (still not complete) Open Economics dataset is at:
<http://knowledgeforge.net/econ/svn/trunk/data/mdg/>
and can be browsed from the store browser at (note data and plot links
won't work because they would 'barf' if trying to display a 9mb csv):
<http://www.openeconomics.net/store/d0b2917e-bfb8-40be-8819-5867f155c1a3>
Below I detail some thoughts and questions from dealing with this
particular item. I know this is a bit thinking out loud but I'd
appreciate any comments/ideas -- after all people have been thinking
about data storage for as long as they have using computers ...
Regards,
Rufus
Issues raised by the MDG data
=============================
1. MDG Data has at least 3 basic dimensions:
* Country
* Series
* Time
Can deal with this by taking one dimension as primary (e.g. series)
though this has the cost that it is harder to divide up in other ways
(e.g. show me these 3 series for this specific country).
2. Values are slightly complex in that they have a type (related to how
reliable the estimates are how they were obtained) and (potentially) an
associated footnote.
* since fn and type are not directly relevant most of the time
perhaps we can just ignore (made easier once we have normalized).
3. Number of countries (~241) make displaying a bit of an issue.
* in web interface this should be dealt with simply by restricting
number of rows we display to a sample (say 50).
3. Also lots of blank values for many series for many countries.
* again not so much of a problem once we normalize
### How to Design (Sub-)Data Bundles ...
One way to deal with the massive amount of data in the single csv file
would be to create 'sub-bundles' corresponding to particular slices
through the data. E.g. could create bundles for each country or each
time series (or even each country by each time series).
But is this a good idea? Perhaps one should just have some script that
can generate such things 'on-demand' though one would still need to
register somewhere what particular 'bundles' could be built on demand ...
Another alternative would be to load said data into a db (after all that
is where it came from and that is the easiest way to deal with getting
multiple views on same dataset). Question then is does one do this in a
way that preserves the ability for data to be provided in a simple
(easily usable) form to other users (and how do we load the data? On
request or permanently into some db or ...).
Related to this is what one would want to if one wanted to provide more
functionality than pure dataset browsing of visualization plugin which
would manipulate a given dataset for the web user -- though still not
sure how manipulating a 9mb csv file.
### What Was the Aim?
But all of this adds significant complexity. What is the aim of
openeconomics.net. At present think its core aim is to provide a simple
repository for datasets. In this regard the web interface is there
simply to provide:
* A dataset browser.
* Persistent urls.
* [?] Simple way to upload and download data.
Providing complex data analysis tools etc is not central to this but is
definitely a possibly extension.
### Left-over questions
* How can we integrate with OpeNDAP (and pydap). Should seem an
obvious option.
* PlotKit javascript graphing library no longer seems actively
developed. what are the other good js graphing libraries ...
More information about the okfn-discuss
mailing list