[Open-Legislation] msg foo

Wed Jan 12 18:51:40 UTC 2011

Hi all,

On Wed, Jan 12, 2011 at 7:10 PM, Levin Alexander <mail at levinalex.net> wrote:
> Three things where I think OCR could help me in the short term:
>
> - Produce very coarse word lists from the documents to enable full
> text search and build stats/graphics
> - OCR the table of contents
> - Extract bounding boxes and the general document structure (number of
> columns, headlines, and so on) - this could be helpful to enable
> building of work units for mechanical turk (looks like the error rate
> is too high for actual conversion [1])

Thats probably a realistic assessment, and I think with a bit of
effort one might be able to add "linearization" of columns on to this
(which would be very useful).

> But for my use case (revision history of german laws) this would be a
> little too early. I need to decide on the representation and document
> formats for all that stuff first.

I would also like to focus on that first, both for the German case and
also because we can get a load of data from the EU. Having a clever
format to represent legal texts and changesets would be very useful,
both in the day-to-day and historic analysis of legal development and
the up-to-date agenda extraction that I'm most interested in
contributing to.

- Friedrich