[School-of-data] Scraping_Loop: Google spreadsheet template based on ImportXML function

Tue Nov 19 11:51:06 UTC 2013

Hi All,

Apologies for the long email. It is about a new open tool I have developed
to help non-programming journalists.

My name is Juan Elosua and I am a Spanish freelance developer. Over the
last two years I have been working on various Open Data and Data Journalism
projects.

I also provide training to journalists on tools they can use to improve
their data-driven stories. When I try to teach web scraping tools I always
end up saying that without coding knowledge the tools out there are not
really that powerful.

Let me explain this a bit. We can divide the scraping process in two parts:

   1. Navigation: Being able to move from one page of data to the next one
   in order to retrieve a complete dataset.
   2. Extraction: Once in a page, what data do I want to scrape.

The problem with the tools out there nowadays is the *navigation* *part*.
We have to manually go from one URL to the next. Once we are on the right
page there are many good tools to extract the desired content out.

For the extraction part of the scraping process I usually resort to the
Google spreadsheet import functions (importHTML, ImportXML, etc.). I think
they are handy tools that go straight to the point.

*The Scraping_Loop template
<https://docs.google.com/spreadsheet/ccc?key=0Ar3KSfz0LI8kdFF6S3d0aUxLek9wblBpdVljMWhPdGc&newcopy>
I have developed allows using the ImportXML function inside a loop,* in
order to automate the scraping process for sites that have many pages of
results, or data from many years. It accumulates all the retrieved data in
a "*Data"* Sheet.

   - http://medicalboard.co.ke/online-services/retention/?currpage=<http://medicalboard.co.ke/online-services/retention/?currpage=1>
   *1*
   - http://medicalboard.co.ke/online-services/retention/?currpage=<http://medicalboard.co.ke/online-services/retention/?currpage=2>
   *2*
   - http://medicalboard.co.ke/online-services/retention/?currpage=<http://medicalboard.co.ke/online-services/retention/?currpage=3>
   *3*

The Scraping_Loop
template<https://docs.google.com/spreadsheet/ccc?key=0Ar3KSfz0LI8kdFF6S3d0aUxLek9wblBpdVljMWhPdGc&newcopy>
has
a README sheet where I try to explain what it does and how to use it. You
can also locate it inside Google Drive templates searching for scraping. I
have also written a public google
document<https://docs.google.com/document/d/1RsWe8osVrPXU_zUWbwB-xa6Ya6WIlDIVeprATc51LBo/edit?usp=sharing>
that
explains the functionality in a little more detail.

Be aware that the script asks for many permissions in order to run, in the
public google document above I explain why the script needs each
authorization... It looks scary, I know, but the script needs them in order
to complete its goal.

The tool is in beta testing, so some bugs may appear, but I have tested it
for some days and I think is ready to be publicly available.

I hope some of you may find this tool useful.

Any feedback is more than welcomed.

Cheers

Juan Elosua

PS: on a related note, import.io seems to be a promising site for
automating web scraping without technical knowledge. It may be worth
keeping an eye on them too if you fit that profile.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/school-of-data/attachments/20131119/bbb31ec2/attachment.html>