[okfn-hu] Hungarian company register?

stef stefan.marsiske at gmail.com
Mon Apr 4 09:17:41 UTC 2011


On Mon, Apr 04, 2011 at 09:00:16AM +0200, Peter Gervai wrote:
> On Sun, Apr 3, 2011 at 11:38, stef <stefan.marsiske at gmail.com> wrote:
> 
> > how are you parsing the pdfs? will it be manual labor?
> 
> Often pdf contains simply the text, pdftotxt converts it just fine.

yes, for humans. for scripts to be further processed, there's not much worse.
PDFs are the tools of the devil (and the printing industry).

> Depends on the individual case.

not really. i learned to avoid pdfs as far as possible, they're as good a
deterent to useful scraping as captchas, or limited search interfaces are.
their positional markup removes all kind of semantic information, you can only
rely on positional heuristics, which always introduce a lot of false
positives. so manual cleaning is also always a necessity.

-- 
gpg: https://www.ctrlc.hu/~stef/stef.gpg
gpg fp: F617 AC77 6E86 5830 08B8  BB96 E7A4 C6CF A84A 7140




More information about the okfn-hu mailing list