[okfn-labs] [okfn-za] technical advice sought: scraping files
Rufus Pollock
rufus.pollock at okfn.org
Fri Dec 7 11:35:14 UTC 2012
Hi Kevin,
I'm cc'ing okfn-labs mailing list [1] where I'm sure people will have
more good suggestions (you may want to join!).
Depending on how geeky / tecchy you are there are various options.
Less geeky:
* http://www.cometdocs.com/ (non-free)
* http://finereader.abbyy.com/ (non-free)
More geeky ... For PDFs:
* pdftotext / pdftoxml
* tesseract for OCR (you could even do PDF => image => tesseract to do
text extraction ...)
* scraperwiki (i think use pdftotext??)
See also:
* <http://getthedata.org/questions/339/excel-table-from-a-pdf/>
* <http://getthedata.org/questions/122/what-tools-or-services-are-good-for-scraping-data-from-websites/>
Rufus
[1]: http://lists.okfn.org/mailman/listinfo/okfn-labs
On 7 December 2012 10:30, Kevin Govender <kg at astro4dev.org> wrote:
> Hi all
> Quick question:
> I have about 200 PDFs and DOCs that contain information I need to scrape
> (need to get email addresses and corresponding names from them). Can you
> point me in the rough direction of what tools I should use? I'm on a Windows
> 7 machine.
> Many thanks in advance
> Regards
> Kevin
>
> Kevin Govender
> IAU Office of Astronomy for Development
> www.astro4dev.org
>
> _______________________________________________
> okfn-za mailing list
> okfn-za at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-za
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-za
More information about the okfn-labs
mailing list