[OpenSpending] Extracting data from PDFs

Thu Dec 20 18:05:08 UTC 2012

 Hi,

> One question for David Cabo: How did you phrase / break down the task
> for the crowdsourcing of the extraction from PDFs (+ factchecking
> afterwards)? This is something we should possibly explore as a
> possible method if people are unable to code / the quality of the docs
> is too bad. Just very keen to know how you framed it and how you
> fact-checked it!
> 
> 

 We wrote a post explaining people why it was important to have this data in an open format and linking to the spreadsheet [1], and used twitter to spread it around with a hashtag. It became viral and we got the Senate done in a couple of days. The GDoc spreadsheet (the one for the Congress [2]) had some basic instructions at the top: basically, all the MPs were listed in rows, next to a link to their PDF, and we asked people to put their name/handle next to an orphan MP, and fill all the columns. There was a 1-to-1 match between form fields and spreadsheet columns. Once they were done with an MP they had to paint his/her name green.

 The fact-checking I had to do myself once the work was done, it took a few hours but it was manageable (350 MPs). There was some missing stuff and some typos, but overall it was good enough. If we had a custom-built app we could have asked for the data 2+ times, but we wanted to run a quick experiment, see if google docs worked as a poor man's crowdsourcing tool.

 I wrote a lessons-learnt post [3], you can get some of it through Translate. Overall, Google Docs scaled well, and initially we ran without user authentication successfully. But at some point stuff got deleted (accident or attack I don't know) and then we got spam, so I had to ask people to register; more uncomfortable, and slowed people, but still ok.

 regards,

/david

[1]: http://derecho-internet.org/node/569
[2]: https://docs.google.com/spreadsheet/ccc?key=0AowzHU9kHzeudHlSemNzcVc2OTRqd05YbnkxdUlhMWc&hl=en_US#gid=0
[3]: http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&eotf=1&u=http%3A%2F%2Fblog.probp.org%2Fpost%2F17031929224%2Ftransparencia-con-una-hoja-de-calculo-adoptaunsenador&act=url

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20121220/25458290/attachment.html>