[Open-Legislation] msg foo

Levin Alexander mail at levinalex.net
Wed Jan 12 18:10:26 UTC 2011


On Wed, Jan 12, 2011 at 09:25, Stefan Sels <stefan at sels.com> wrote:

> it would not give us nice markup what we would like to have but we could do
> this nice keyword graphs.
>
> is there a wiki i shold write tools for that kind of stuff into it?
>
> there was somebody with gigs of PDFs.

Me.

tl;dr: Yes, a free (and good!) OCR service would be very cool, but I
think I'm not at the point where I need it.


Three things where I think OCR could help me in the short term:

- Produce very coarse word lists from the documents to enable full
text search and build stats/graphics
- OCR the table of contents
- Extract bounding boxes and the general document structure (number of
columns, headlines, and so on) - this could be helpful to enable
building of work units for mechanical turk (looks like the error rate
is too high for actual conversion [1])

But for my use case (revision history of german laws) this would be a
little too early. I need to decide on the representation and document
formats for all that stuff first.

The data I have is "real" pdfs (as opposed to scans) since about 1998,
so I think I't would be best to focus on that part in the beginning.

--Levin

[1] (I just tried ocrad, gocr, tesseract and google docs to see what
they would make of the files. The results were really bad (only
default settings, no training at all)
I also expect commercial software to be much better (and training
should also make a huge difference as the data is highly regular)




More information about the open-legislation mailing list