[humanities-dev] Extracting Texts from Wikisource

Etienne Posthumus eposthumus at gmail.com
Mon Oct 1 09:56:51 UTC 2012


Hi Sam

On 1 October 2012 11:37, Sam Leon <sam.leon at okfn.org> wrote:
> Does anyone have any ideas on how to extract the plain text from Wikisource
> other than manually copying and pasting each section? This would obviously
> be something we should aim to automate via a script.

You can call the 'raw' Mediawiki like so:

http://en.wikisource.org/w/index.php?action=raw&title=De_Cive

Which will give you the plain text, with some metadata embedded. From
here it would be a small script with some regex step to strip out only
the necessary.
Possibly the text in this format is already usable?

cheers,

Etienne




More information about the humanities-dev mailing list