[humanities-dev] Extracting Texts from Wikisource

Sam Leon sam.leon at okfn.org
Wed Oct 3 10:13:05 UTC 2012


Hi Etienne,

You're a champ. I managed to get hold of .txt of De Cive but this script is
an immensely useful start on a much needed TEXTUS extension.

Cheers,
Sam

On Mon, Oct 1, 2012 at 8:07 PM, Etienne Posthumus <eposthumus at gmail.com>wrote:

> Aha, gotit. You want _all_ the texts, not just the preface.
>
> Here is a quick script with minimal exception handling to do it:
> https://gist.github.com/3813746
>
> If you run it like this:
>
> python gimmesrc.py De_Cive > txt
>
> It will first read the table of contents of the title as specified on
> the command line, and then suck each page listed i the toc, and print
> it out.
>
> cheers,
>
> Etienne
>
> On 1 October 2012 16:51, Sam Leon <sam.leon at okfn.org> wrote:
> > That gives me plain text, but only of the contents page. The problem with
> > that text is that it is divided over many web-pages it seems...
> --8<-- snip --
> > Any ideas for automated extraction of texts like De Cive welcome...
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>



-- 
Sam Leon
Community Coordinator
Open Knowledge Foundation
http://okfn.org/
Twitter: @noeL_maS
Skype: samedleon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20121003/9d510860/attachment.html>


More information about the humanities-dev mailing list