[okfn-za] technical advice sought: scraping files

Fri Dec 7 11:35:14 UTC 2012

Hi Kevin,

I'm cc'ing okfn-labs mailing list [1] where I'm sure people will have
more good suggestions (you may want to join!).

Depending on how geeky / tecchy you are there are various options.

Less geeky:

* http://www.cometdocs.com/ (non-free)
* http://finereader.abbyy.com/ (non-free)

More geeky ... For PDFs:

* pdftotext / pdftoxml
* tesseract for OCR (you could even do PDF => image => tesseract to do
text extraction ...)
* scraperwiki (i think use pdftotext??)

See also:

* <http://getthedata.org/questions/339/excel-table-from-a-pdf/>
* <http://getthedata.org/questions/122/what-tools-or-services-are-good-for-scraping-data-from-websites/>

Rufus

[1]: http://lists.okfn.org/mailman/listinfo/okfn-labs

On 7 December 2012 10:30, Kevin Govender <kg at astro4dev.org> wrote:
> Hi all
> Quick question:
> I have about 200 PDFs and DOCs that contain information I need to scrape
> (need to get email addresses and corresponding names from them). Can you
> point me in the rough direction of what tools I should use? I'm on a Windows
> 7 machine.
> Many thanks in advance
> Regards
> Kevin
>
> Kevin Govender
> IAU Office of Astronomy for Development
> www.astro4dev.org
>
> _______________________________________________
> okfn-za mailing list
> okfn-za at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-za
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-za