[OpenSpending] Extracting data from PDFs

David Cabo david.cabo at gmail.com
Fri Dec 21 12:47:45 UTC 2012


 Hi,

 We did it back in September 2011, I don't think Pybossa was available at the time. I'd certainly consider it now, it looks interesting; but I don't know enough about it to have an opinion.

/dvid 

On Friday, December 21, 2012 at 11:14 AM, Michael Bauer wrote:

> Hi,
> 
> This sounds great, I will look at it and recommend this. Did you think
> about using crowdsourcing tools like pybossa (crowdcrafting.org (http://crowdcrafting.org)) ? 
> 
> Michael
> 
> On Thu, Dec 20, 2012 at 07:05:08PM +0100, David Cabo wrote:
> > Hi,
> > 
> > > One question for David Cabo: How did you phrase / break down the task
> > > for the crowdsourcing of the extraction from PDFs (+ factchecking
> > > afterwards)? This is something we should possibly explore as a
> > > possible method if people are unable to code / the quality of the docs
> > > is too bad. Just very keen to know how you framed it and how you
> > > fact-checked it!
> > > 
> > 
> > 
> > We wrote a post explaining people why it was important to have this data in an open format and linking to the spreadsheet [1], and used twitter to spread it around with a hashtag. It became viral and we got the Senate done in a couple of days. The GDoc spreadsheet (the one for the Congress [2]) had some basic instructions at the top: basically, all the MPs were listed in rows, next to a link to their PDF, and we asked people to put their name/handle next to an orphan MP, and fill all the columns. There was a 1-to-1 match between form fields and spreadsheet columns. Once they were done with an MP they had to paint his/her name green.
> > 
> > The fact-checking I had to do myself once the work was done, it took a few hours but it was manageable (350 MPs). There was some missing stuff and some typos, but overall it was good enough. If we had a custom-built app we could have asked for the data 2+ times, but we wanted to run a quick experiment, see if google docs worked as a poor man's crowdsourcing tool.
> > 
> > I wrote a lessons-learnt post [3], you can get some of it through Translate. Overall, Google Docs scaled well, and initially we ran without user authentication successfully. But at some point stuff got deleted (accident or attack I don't know) and then we got spam, so I had to ask people to register; more uncomfortable, and slowed people, but still ok.
> > 
> > regards,
> > 
> > /david
> > 
> > [1]: http://derecho-internet.org/node/569
> > [2]: https://docs.google.com/spreadsheet/ccc?key=0AowzHU9kHzeudHlSemNzcVc2OTRqd05YbnkxdUlhMWc&hl=en_US#gid=0
> > [3]: http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&eotf=1&u=http%3A%2F%2Fblog.probp.org%2Fpost%2F17031929224%2Ftransparencia-con-una-hoja-de-calculo-adoptaunsenador&act=url
> > 
> 
> 
> -- 
> Data Wrangler with the Open Knowledge Foundation (OKFN.org (http://OKFN.org))
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
> Twitter: @mihi_tr Skype: mihi_tr
> 
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20121221/23758127/attachment.html>


More information about the openspending mailing list