[humanities-dev] Extracting Texts from Wikisource

Sam Leon sam.leon at okfn.org
Wed Oct 3 10:13:05 UTC 2012

Hi Etienne,

You're a champ. I managed to get hold of .txt of De Cive but this script is
an immensely useful start on a much needed TEXTUS extension.


On Mon, Oct 1, 2012 at 8:07 PM, Etienne Posthumus <eposthumus at gmail.com>wrote:

> Aha, gotit. You want _all_ the texts, not just the preface.
> Here is a quick script with minimal exception handling to do it:
> https://gist.github.com/3813746
> If you run it like this:
> python gimmesrc.py De_Cive > txt
> It will first read the table of contents of the title as specified on
> the command line, and then suck each page listed i the toc, and print
> it out.
> cheers,
> Etienne
> On 1 October 2012 16:51, Sam Leon <sam.leon at okfn.org> wrote:
> > That gives me plain text, but only of the contents page. The problem with
> > that text is that it is divided over many web-pages it seems...
> --8<-- snip --
> > Any ideas for automated extraction of texts like De Cive welcome...
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev

Sam Leon
Community Coordinator
Open Knowledge Foundation
Twitter: @noeL_maS
Skype: samedleon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20121003/9d510860/attachment.html>

More information about the humanities-dev mailing list