[Open-Legislation] docfnord

stef stefan.marsiske at gmail.com
Tue Jan 11 22:48:09 UTC 2011


besides having ocr being very important. i also would like to stress the need
for a pdf to some sensible markup. not much, bullets, headings, footnotes
should be recognized. tables would be a big bonus. lots of documents are only
available as pdfs for datamining them we need a good representation of the
text.

On Tue, Jan 11, 2011 at 07:24:55AM +0100, JOSEFSSON Erik wrote:
> On 01/11/2011 12:38 AM, Stefan Sels wrote:
> 
> 
> 
> 	I hope we dont have to OCR anything :) 
> 	
> 	
> 
> 
> There is definitely a need for a quick, reliable and anonymous OCR service (fax2html would be great).
> 
> For example, it's easy to imagine a paper copy of the Hungarian media law could have been leaked a week or two before the English translation was officially released. With OCR, we would then have had a markup of the copypaste parts (the longstrings/frags/pippies) of the law in an instant.
> 
> And with that markup, it would have been easier to analyse (and criticise) the Hungarian government's claim the law is compliant with existing EU legislation (see http://euwiki.org/HUMED#Article_230 ).
> 
> So let me add OCR to the wish list :-)
> 
> //Erik
> 
> 
> 
> -- 
> Erik Josefsson
> Adviser on internet policies
> Greens/EFA Group
> GSM: +32484082063
> BXL: PHS 04C075 TEL: +3222832667
> SBG: WIC M03005 TEL: +33388173776
> 

---end quoted text---

-- 
gpg: https://www.ctrlc.hu/~stef/stef.gpg
gpg fp: F617 AC77 6E86 5830 08B8  BB96 E7A4 C6CF A84A 7140




More information about the open-legislation mailing list