[open-science] Data Digitzer

Peter Murray-Rust pm286 at cam.ac.uk
Wed Nov 23 08:08:57 UTC 2011


On Wed, Nov 23, 2011 at 5:19 AM, Jenny Molloy <jcmcoppice12 at gmail.com>wrote:

> Thanks Jonathan!
>
> Sadly, the data digitiser is currently aimed at aiding manual
> transcription of tabular data from PDFs/images rather than automating the
> process as the second blog describes, which would obviously be very awesome
> but we quickly decided impossible in a day (Dd was hacked together at the
> Open Science Workshop) if not impossible full stop. I get the impression
> with automated digitisation that maintains tabular structure that many have
> tried extremely hard and all have failed thus far, although if anyone knows
> of any open projects that are getting close then let us know!
>

There is no magic bullet. It depends very much on the source. If the
material is written as a table from (say) an Adobe product then it may be
possible to recover this with the same product.This costs money and I don't
know whether it can be run in the batch. Anycase I think it's rare.

Lee Giles and Prasenjit Mitra have worked on this (Cite Seer, Penn State)
and  can get up to 80-90 Precision/recall. That is with a years' work. If
you have a specific source then heuristics can be applied. If the PDF has
preserved line primitives then it's possible to make progress (I have done
this for chemistry). If it's a bitmap then you have a hamburger.

Our effort is better spent trying to change culture, I think. But it's a
massive task in science. After all scientific data in journals belongs to
the publishers, doesn't it :-(

>
> I've commented on
> http://www.aboutsocialdata.org/tag/open-knowledge-foundation/ with words
> to this effect.
>
> Jenny
>
>
> On Tue, Nov 22, 2011 at 6:03 PM, Jonathan Gray <jonathan.gray at okfn.org>wrote:
>
>> Thought this might be of interest!
>>
>> http://blog.okfn.org/2011/11/17/introducing-the-data-digitizer/
>> http://www.aboutsocialdata.org/tag/open-knowledge-foundation/
>>
>> --
>> Jonathan Gray
>>
>> Community Coordinator
>> The Open Knowledge Foundation
>> http://www.okfn.org
>>
>> http://twitter.com/jwyg
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>>
>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20111123/573d5b4f/attachment-0001.html>


More information about the open-science mailing list