[humanities-dev] Extracting Texts from Wikisource

todd.d.robbins at gmail.com todd.d.robbins at gmail.com
Mon Oct 1 20:31:20 UTC 2012


Etienne,

Rock on!

On Mon, Oct 1, 2012 at 1:07 PM, Etienne Posthumus <eposthumus at gmail.com>wrote:

> Aha, gotit. You want _all_ the texts, not just the preface.
>
> Here is a quick script with minimal exception handling to do it:
> https://gist.github.com/3813746
>
> If you run it like this:
>
> python gimmesrc.py De_Cive > txt
>
> It will first read the table of contents of the title as specified on
> the command line, and then suck each page listed i the toc, and print
> it out.
>
> cheers,
>
> Etienne
>
> On 1 October 2012 16:51, Sam Leon <sam.leon at okfn.org> wrote:
> > That gives me plain text, but only of the contents page. The problem with
> > that text is that it is divided over many web-pages it seems...
> --8<-- snip --
> > Any ideas for automated extraction of texts like De Cive welcome...
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>


-- 
Tod Robbins
MLIS 2012 | University of Washington
todrobbins.com | @todrobbins <http://www.twitter.com/#!/todrobbins>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20121001/f0847f00/attachment.html>


More information about the humanities-dev mailing list