[okfn-labs] [okfn-za] technical advice sought: scraping files

Michael Bauer michael.bauer at okfn.org
Fri Dec 7 13:31:25 UTC 2012


Kevin,

I used scaperwiki and pdftoxml there before. PDF2XML works nicely together
with libxml for extracting tables and such - especially when very specific
formatting is applied to tables. I've written some PDF scrapers based on
scraperwiki - if this is the place to go - tell me.

Otherwise, since you are looking for email addresses and names: pdf2text
and antiword (for doc) work - then use a specific regex to find email
adresses.

Michael

On Fri, Dec 07, 2012 at 11:35:14AM +0000, Rufus Pollock wrote:
> Hi Kevin,
> 
> I'm cc'ing okfn-labs mailing list [1] where I'm sure people will have
> more good suggestions (you may want to join!).
> 
> Depending on how geeky / tecchy you are there are various options.
> 
> Less geeky:
> 
> * http://www.cometdocs.com/ (non-free)
> * http://finereader.abbyy.com/ (non-free)
> 
> More geeky ... For PDFs:
> 
> * pdftotext / pdftoxml
> * tesseract for OCR (you could even do PDF => image => tesseract to do
> text extraction ...)
> * scraperwiki (i think use pdftotext??)
> 
> See also:
> 
> * <http://getthedata.org/questions/339/excel-table-from-a-pdf/>
> * <http://getthedata.org/questions/122/what-tools-or-services-are-good-for-scraping-data-from-websites/>
> 
> Rufus
> 
> [1]: http://lists.okfn.org/mailman/listinfo/okfn-labs
> 
> On 7 December 2012 10:30, Kevin Govender <kg at astro4dev.org> wrote:
> > Hi all
> > Quick question:
> > I have about 200 PDFs and DOCs that contain information I need to scrape
> > (need to get email addresses and corresponding names from them). Can you
> > point me in the rough direction of what tools I should use? I'm on a Windows
> > 7 machine.
> > Many thanks in advance
> > Regards
> > Kevin
> >
> > Kevin Govender
> > IAU Office of Astronomy for Development
> > www.astro4dev.org
> >
> > _______________________________________________
> > okfn-za mailing list
> > okfn-za at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-za
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-za
> 
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs

-- 
Data Wrangler with the Open Knowledge Foundation (OKFN.org)
GPG/PGP key: http://tentacleriot.eu/mihi.asc
Twitter: @mihi_tr Skype: mihi_tr




More information about the okfn-labs mailing list