[humanities-dev] Extracting Texts from Wikisource
Etienne Posthumus
eposthumus at gmail.com
Mon Oct 1 09:56:51 UTC 2012
Hi Sam
On 1 October 2012 11:37, Sam Leon <sam.leon at okfn.org> wrote:
> Does anyone have any ideas on how to extract the plain text from Wikisource
> other than manually copying and pasting each section? This would obviously
> be something we should aim to automate via a script.
You can call the 'raw' Mediawiki like so:
http://en.wikisource.org/w/index.php?action=raw&title=De_Cive
Which will give you the plain text, with some metadata embedded. From
here it would be a small script with some regex step to strip out only
the necessary.
Possibly the text in this format is already usable?
cheers,
Etienne
More information about the humanities-dev
mailing list