[wdmmg-dev] Mark's AidData

Mark Brough mark.brough at publishwhatyoufund.org
Mon Jun 20 17:56:47 UTC 2011


Hi Friedrich

> Wow thats one impressive thing - but very nice, although it could of
> course be 7 functions ;)

Yeah, I think it should certainly be 7 functions! Also it shouldn't write to
the database several times every activity (Rails is struggling with a new
write for each activity*~50-100 / transaction*~5 / sectors*~3 / policy
markers*~2, even without countries / organisations) but maybe do bulk
insertions instead. I think I'm guilty of consistent and massive violations
of DRY.. Think this might help:
https://github.com/theAlmanac/crewait/blob/master/lib/crewait.rb

> see the attached script that runs against the German federal budget to add
> classification details and colors

Your normalization function looks good (to the extent I understand it!). I'm
trying to think about how normalisation would work with countries and
regions. The way I've thought about it so far is:

  - If the country field exists in the IATI data
  -- if that country is in the DB, get its ID as @countryregion
  -- otherwise, write a new row and get that row ID as @countryregion
  - Elsif the region exists in the IATI data
  -- if that region is in the DB, get its ID as @countryregion
  -- otherwise, write a new row and get that row ID as @countryregion

  --> Set the activity[:countryregion] field as @countryregion

But maybe there is a better way. Organisations would be similar.

Re flattening transactions activities - I'll ask around about this.
a) we can always include activity data in a column as activity_... and
transaction data as transaction_... if we needed to keep both (I think this
is what you're saying?)
b) thinking a little bit about whether the transaction is in any way a
problematic unit of analysis.
* I think it is likely that the data quality will improve and that data
about the same activities will become more granular over time. If we're
updating data, what are people then referring to if the transactions rather
than have e.g. "Total disbursement for Q1 2011" have much more detail
(disbursement to contractor X on date Y)? It's the same activity, but not
the same transaction. Is that a problem?
* How do we deal with related activities? (I guess just have activity_id in
each transaction and then a related_activities table, and link transactions
together as if they were activities... if that makes sense??)

That way of updating sounds sensible. Presumably it would be able to update
existing entities as well?

CSV -> IATI: I'll have a think about this then. Would be an amazing tool to
show people the value of opening up their data - here you go, we've put
together all the aid information we could find about you from your various
different websites and here's a nice map :). Also, I think we should try and
push as much as possible through IATI-XML: create a generic tool to convert
CSV->IATI-XML and then more customized tools to convert IATI-XML to whatever
type of setup each system needs. I think this should hopefully minimize the
amount of coding necessary?

Thanks for advice re OpenSpending - will have another look at it this
evening.

Cheers
Mark


-----Original Message-----
From: friedrich.lindenberg at gmail.com [mailto:friedrich.lindenberg at gmail.com]
On Behalf Of Friedrich Lindenberg
Sent: 20 June 2011 13:04
To: Mark Brough
Cc: wdmmg-dev at lists.okfn.org
Subject: Re: [wdmmg-dev] Mark's AidData

Hi Mark,

On Mon, Jun 20, 2011 at 12:26 PM, Mark Brough
<mark.brough at publishwhatyoufund.org> wrote:
> Great. Actually, this is good timing, as I got confused about what I
> was trying to do with normalising the countries and organisations and
> ended up with an almost infinite process. So I'm going to try and
> re-work most of the import controller. But I'm also starting to feel
> like me building a complicated relational database is maybe not really
> worth it to just show some nice pictures...
>
> Wednesday sounds good. I'm going to Budapest on Wednesday evening but
> most of the day would be fine.
>
> CSV Mapping
> I don't have the CSV mapping, I just parse the XML directly into the
> database in a massive function, you can see it here:
> https://github.com/markbrough/IATI-Data/blob/master/app/controllers/ia
> tiregistry_controller.rb

Wow thats one impressive thing - but very nice, although it could of course
be 7 functions ;)

> I didn't try to import the IATI data into OpenSpending because I
> couldn't get existing packages to work, so I figured creating my own
> would be even less likely! But I'll have a think about:
> a) What else would need to be added to or changed in iati2csv (and
> your
> mapping) to make it complete and hopefully work for DFID/WB/any future
> IATI data

Basically we'd need to implement a kind of extension stage where regions and
sectors are normalized (I'm usually using Google Docs for this kind of
thing, see the attached script that runs against the German federal budget
to add classification details and colors).

> b) If there's any information in an activity which is not shared by
> all the transactions - I think there might be but not sure. And also,
> whether this matters.

I haven't seen that yet but would be very interesting if it did in fact
exist. Wouldn't be impossible to fix though (I'm currently overwriting
things like default-aid-type with aid-type when they are specified in the
transaction.

> Import CSV/XML
> I take the point about the maintenance nightmare - although at the
> same time, it would be nice for there to be some way to update
> reasonably easily from the IATI Registry as:
> a) new donors publish (should be another 7 or so by November)
> b) existing donors update their data (DFID last updated about 2 weeks
> ago - I think they do so every month).
> c) I'm thinking about building an example CSV to IATI converter -
> where you upload your aid data (e.g. Estonia/Norway/PEPFAR) and map it
> to IATI fields and it gives it back to you in IATI XML. Is that a good
> idea?

That's a really cool idea - I have some data from EuropeAid and I think we
looked at spain together. Given a reasonably simple CSV to IATI importer we
might be able to do some of their work for them and thus get a nicer
database and more chances to compare different countries' efforts.

As for updating: once we have the scripts to download, normalize and import
I think its very realistic to just combine them into a shell script or
Makefile so they become a consistent pipeline. We're also thinking about
using more sophisticated ETL things or even integrating some of this into
CKAN but as far as I know nothing of this is ready yet.

> On the other hand, I guess more manual processing does sound like it
> could be better for tidying up data before import. And there are some
> cool possibilities, like pulling in geo-coded data for each WB project
> (which isn't in their IATI data but it is normally in the Mapping for
> Results data) via this: http://api.worldbank.org/api/projects -- example:
> http://search.worldbank.org/api/projects?qterm=*:*&fl=id,location&coun
> trycode[]=IN&format=json

Amazing :-) Re manual work I think we do want to reduce this to none as soon
as we now the precise steps that are required on any given dataset - do it
manually once and then automate.

> My errors with OS
> Re my installation of OpenSpending (on Ubuntu 11.04), looking at the
> Uganda dataset, this works fine:
> http://127.0.0.1:5000/dataset/uganda/dimension/from
> http://127.0.0.1:5000/dataset/uganda/dimension/to
>
> This gives error 500 (attached error from the paster and solr consoles):
> http://127.0.0.1:5000/dataset/uganda

This looks like solr was not available - did you set the solr url in your
.ini file?

> (I ran paster load uganda (with some new-ish but not the final data)
> and got no errors, these are the last few lines:
> 2011-06-20 10:32:46,910 INFO  [wdmmg.lib.loader] uganda loaded 11000
> in 0.89s 2011-06-20 10:32:47,181 INFO  [wdmmg.lib.cubes] compute cube
> for dataset 'uganda', cube name: 'default', dimensions: 'to, from,
> swg, sector_objective, year'
> 2011-06-20 10:32:49,786 INFO  [wdmmg.lib.cubes] Done. Took: 2s
>
> I tried pater load cra with the CRA dataset as well and it looked like
> everything was going OK until I got this error:
> IOError: [Errno 2] No such file or directory:
> '/home/pwyf/env/wdmmg/pylons_data/getdata/ukgov-finances-cra/nuts1_population_2006.csv'
> )

Ah I think you may need to run the install_data script that will download
all relevant pieces of data for CRA (like the population statistics
mentioned here)

- Friedrich




More information about the openspending-dev mailing list