[okfn-discuss] Open Economics and Data Packaging: some thoughts and questions

Rufus Pollock rufus.pollock at okfn.org
Fri Jan 11 14:35:13 UTC 2008


One part of our Open Economics project (http://www.openeconomics.net/) 
involves maintaining a data store:

   http://www.openeconomics.net/store/

Corresponding data is stored in subversion:

   http://knowledgeforge.net/econ/svn/trunk/data/

Actually having to store some data (in a simple way) has been a great 
exercise in terms of thinking about how one does this kind of thing. In 
keeping with the KISS approach basic structure of the data sets (or 
'bundles' as I have termed them) is:

   * metadata.txt: follows ini style conventions
   * data itself stored either:
     * plain csv: data.csv
     * and/or script file of some kind: data.py (which helps generate 
data.csv in some way via downloading, parsing etc etc).

To illustrate one dataset we've recently been working to include is that 
from Millenium Development Goals Project

   <http://mdgs.un.org/unsd/mdg/>

The corresponding (still not complete) Open Economics dataset is at:

   <http://knowledgeforge.net/econ/svn/trunk/data/mdg/>

and can be browsed from the store browser at (note data and plot links 
won't work because they would 'barf' if trying to display a 9mb csv):

<http://www.openeconomics.net/store/d0b2917e-bfb8-40be-8819-5867f155c1a3>

Below I detail some thoughts and questions from dealing with this 
particular item. I know this is a bit thinking out loud but I'd 
appreciate any comments/ideas -- after all people have been thinking 
about data storage for as long as they have using computers ...

Regards,

Rufus


Issues raised by the MDG data
=============================

1. MDG Data has at least 3 basic dimensions:

   * Country
   * Series
   * Time

Can deal with this by taking one dimension as primary (e.g. series) 
though this has the cost that it is harder to divide up in other ways 
(e.g. show me these 3 series for this specific country).

2. Values are slightly complex in that they have a type (related to how 
reliable the estimates are how they were obtained) and (potentially) an 
associated footnote.

    * since fn and type are not directly relevant most of the time 
perhaps we can just ignore (made easier once we have normalized).

3. Number of countries (~241) make displaying a bit of an issue.

   * in web interface this should be dealt with simply by restricting 
number of rows we display to a sample (say 50).

3. Also lots of blank values for many series for many countries.

   * again not so much of a problem once we normalize

### How to Design (Sub-)Data Bundles ...

One way to deal with the massive amount of data in the single csv file 
would be to create 'sub-bundles' corresponding to particular slices 
through the data. E.g. could create bundles for each country or each 
time series (or even each country by each time series).

But is this a good idea? Perhaps one should just have some script that 
can generate such things 'on-demand' though one would still need to 
register somewhere what particular 'bundles' could be built on demand ...

Another alternative would be to load said data into a db (after all that 
is where it came from and that is the easiest way to deal with getting 
multiple views on same dataset). Question then is does one do this in a 
way that preserves the ability for data to be provided in a simple 
(easily usable) form to other users (and how do we load the data? On 
request or permanently into some db or ...).

Related to this is what one would want to if one wanted to provide more 
functionality than pure dataset browsing of visualization plugin which 
would manipulate a given dataset for the web user -- though still not 
sure how manipulating a 9mb csv file.

### What Was the Aim?

But all of this adds significant complexity. What is the aim of 
openeconomics.net. At present think its core aim is to provide a simple 
repository for datasets. In this regard the web interface is there 
simply to provide:

   * A dataset browser.
   * Persistent urls.
   * [?] Simple way to upload and download data.

Providing complex data analysis tools etc is not central to this but is 
definitely a possibly extension.

### Left-over questions

   * How can we integrate with OpeNDAP (and pydap). Should seem an 
obvious option.
   * PlotKit javascript graphing library no longer seems actively 
developed. what are the other good js graphing libraries ...




More information about the okfn-discuss mailing list