[humanities-dev] Extracting Texts from Wikisource

Etienne Posthumus eposthumus at gmail.com
Mon Oct 1 19:07:20 UTC 2012


Aha, gotit. You want _all_ the texts, not just the preface.

Here is a quick script with minimal exception handling to do it:
https://gist.github.com/3813746

If you run it like this:

python gimmesrc.py De_Cive > txt

It will first read the table of contents of the title as specified on
the command line, and then suck each page listed i the toc, and print
it out.

cheers,

Etienne

On 1 October 2012 16:51, Sam Leon <sam.leon at okfn.org> wrote:
> That gives me plain text, but only of the contents page. The problem with
> that text is that it is divided over many web-pages it seems...
--8<-- snip --
> Any ideas for automated extraction of texts like De Cive welcome...




More information about the humanities-dev mailing list