[humanities-dev] Extracting Texts from Wikisource

Sam Leon sam.leon at okfn.org
Mon Oct 1 14:51:08 UTC 2012


Hi Etienne,

That gives me plain text, but only of the contents page. The problem with
that text is that it is divided over many web-pages it seems...

I've just been sent the UTF by someone, but this is illustrative of a big
problem for TEXTUS given that one of its main sources will be Wikisource.

Any ideas for automated extraction of texts like De Cive welcome...

Best,
Sam

On Mon, Oct 1, 2012 at 10:56 AM, Etienne Posthumus <eposthumus at gmail.com>wrote:

> Hi Sam
>
> On 1 October 2012 11:37, Sam Leon <sam.leon at okfn.org> wrote:
> > Does anyone have any ideas on how to extract the plain text from
> Wikisource
> > other than manually copying and pasting each section? This would
> obviously
> > be something we should aim to automate via a script.
>
> You can call the 'raw' Mediawiki like so:
>
> http://en.wikisource.org/w/index.php?action=raw&title=De_Cive
>
> Which will give you the plain text, with some metadata embedded. From
> here it would be a small script with some regex step to strip out only
> the necessary.
> Possibly the text in this format is already usable?
>
> cheers,
>
> Etienne
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>



-- 
Sam Leon
Community Coordinator
Open Knowledge Foundation
http://okfn.org/
Twitter: @noeL_maS
Skype: samedleon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20121001/efabb361/attachment.html>


More information about the humanities-dev mailing list