[okfn-dev] New idea : tool to assist with digitisation

Jonathan Gray jonathan.gray at okfn.org
Fri Apr 1 10:37:17 UTC 2011

Amazing! :-)

I've previously put a few notes on something like this here:



On Fri, Apr 1, 2011 at 12:21 PM, Tim McNamara
<paperless at timmcnamara.co.nz> wrote:
> I've had a brief discussion with Rufus about an application that I have a
> mental sketch about. The tool aims to make it easier for organisations to
> digitise content. I would like to create the tool and then make it part of
> OKFN's suite if it proves useful.
> This is how I imagine the application would work:
> Consider the case where you need to enter data into a database from several
> hundred completed paper forms.
> Administrators create project templates. They scan and upload one of the
> paper forms. Then, the admin outlines bounding boxes (bbox) in the UI of
> where fields are and adds labels, such as "Last name", "Question 7", etc.
> Those coordinates are along with the label stored to create a project
> template.
> The application uses email as the bulk upload "API", as all modern printers
> can email PDF or TIFF files. The application then crops out sections that
> are slightly larger than the coordinate section. The database is then
> populated with many thousands of tasks that need data entry.
> The application then shuffles the queue. This protects the anonymity of the
> people in the form. Shuffling by question it makes it nearly impossible for
> any of the data entry clerks to gather a full picture of any of the
> individuals.
> Digitising unstructured content, such as literature, should be easier than
> structured content. The application would just split each page into a few
> lines, in a similar manner to Project Gutenberg's process (although they
> give every volunteer a whole page).
> The application has quite broad applications for liberating data and
> content. I have written about a hundred lines of code to experiment with
> extracting the bounding box coordinates and cropping images. Before I went
> further, I thought it would be useful to gather feedback, support and
> suggestions.
> I am thinking about creating a hosted version on Google App Engine with code
> under the AGPL. That way, it would be possible for individual organisations
> to run their own instance without lock-in or concerns about misuse of the
> data.
> I would like to run some more sophisticated image processing and an OCR
> engine over the content as well. However, it's impossible to run arbitrary
> binaries on App Engine.
> My best,
> Tim McNamara
> _______________________________________________
> okfn-dev mailing list
> okfn-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-dev

Jonathan Gray

Community Coordinator
The Open Knowledge Foundation


More information about the okfn-labs mailing list