[humanities-dev] Extracting Texts from Wikisource
Etienne Posthumus
eposthumus at gmail.com
Mon Oct 1 19:07:20 UTC 2012
Aha, gotit. You want _all_ the texts, not just the preface.
Here is a quick script with minimal exception handling to do it:
https://gist.github.com/3813746
If you run it like this:
python gimmesrc.py De_Cive > txt
It will first read the table of contents of the title as specified on
the command line, and then suck each page listed i the toc, and print
it out.
cheers,
Etienne
On 1 October 2012 16:51, Sam Leon <sam.leon at okfn.org> wrote:
> That gives me plain text, but only of the contents page. The problem with
> that text is that it is divided over many web-pages it seems...
--8<-- snip --
> Any ideas for automated extraction of texts like De Cive welcome...
More information about the humanities-dev
mailing list