[Open-Legislation] docfnord
stef
stefan.marsiske at gmail.com
Tue Jan 11 22:48:09 UTC 2011
besides having ocr being very important. i also would like to stress the need
for a pdf to some sensible markup. not much, bullets, headings, footnotes
should be recognized. tables would be a big bonus. lots of documents are only
available as pdfs for datamining them we need a good representation of the
text.
On Tue, Jan 11, 2011 at 07:24:55AM +0100, JOSEFSSON Erik wrote:
> On 01/11/2011 12:38 AM, Stefan Sels wrote:
>
>
>
> I hope we dont have to OCR anything :)
>
>
>
>
> There is definitely a need for a quick, reliable and anonymous OCR service (fax2html would be great).
>
> For example, it's easy to imagine a paper copy of the Hungarian media law could have been leaked a week or two before the English translation was officially released. With OCR, we would then have had a markup of the copypaste parts (the longstrings/frags/pippies) of the law in an instant.
>
> And with that markup, it would have been easier to analyse (and criticise) the Hungarian government's claim the law is compliant with existing EU legislation (see http://euwiki.org/HUMED#Article_230 ).
>
> So let me add OCR to the wish list :-)
>
> //Erik
>
>
>
> --
> Erik Josefsson
> Adviser on internet policies
> Greens/EFA Group
> GSM: +32484082063
> BXL: PHS 04C075 TEL: +3222832667
> SBG: WIC M03005 TEL: +33388173776
>
---end quoted text---
--
gpg: https://www.ctrlc.hu/~stef/stef.gpg
gpg fp: F617 AC77 6E86 5830 08B8 BB96 E7A4 C6CF A84A 7140
More information about the open-legislation
mailing list