Tim McNamara
Fri Apr 1 10:21:50 UTC 2011

I've had a brief discussion with Rufus about an application that I have a
mental sketch about. The tool aims to make it easier for organisations to
digitise content. I would like to create the tool and then make it part of
OKFN's suite if it proves useful.

This is how I imagine the application would work:

Consider the case where you need to enter data into a database from several
hundred completed paper forms.

Administrators create project templates. They scan and upload one of the
paper forms. Then, the admin outlines bounding boxes (bbox) in the UI of
where fields are and adds labels, such as "Last name", "Question 7", etc.
Those coordinates are along with the label stored to create a project

The application uses email as the bulk upload "API", as all modern printers
can email PDF or TIFF files. The application then crops out sections that
are slightly larger than the coordinate section. The database is then
populated with many thousands of tasks that need data entry.

The application then shuffles the queue. This protects the anonymity of the
people in the form. Shuffling by question it makes it nearly impossible for
any of the data entry clerks to gather a full picture of any of the

Digitising unstructured content, such as literature, should be easier than
structured content. The application would just split each page into a few
lines, in a similar manner to Project Gutenberg's process (although they
give every volunteer a whole page).

The application has quite broad applications for liberating data and
content. I have written about a hundred lines of code to experiment with
extracting the bounding box coordinates and cropping images. Before I went
further, I thought it would be useful to gather feedback, support and

I am thinking about creating a hosted version on Google App Engine with code
under the AGPL. That way, it would be possible for individual organisations
to run their own instance without lock-in or concerns about misuse of the

I would like to run some more sophisticated image processing and an OCR
engine over the content as well. However, it's impossible to run arbitrary
binaries on App Engine.




