[humanities-dev] Extracting Texts from Wikisource

Mon Oct 1 09:37:12 UTC 2012

Hi All,

I'm preparing some texts on an instance of TEXTUS [1] for a series of
readin groups to start next week.

The current TEXTUS set up requires that these texts are saved as .txt files
and marked up with Wikitext.

For a couple of the texts I'm curating Gutenberg does not hold a copy. This
means I have to use Wikisource which doesn't give the option of downloading
a UTF-8 version.

Does anyone have any ideas on how to extract the plain text from Wikisource
other than manually copying and pasting each section? This would obviously
be something we should aim to automate via a script.

For an example of what I'm talking about see the Wikisource entry for Hobbes
"De Cive" <http://en.wikisource.org/wiki/De_Cive> [1]

Ideas much appreciated!

Cheers,
Sam

[1] http://mytextus.herokuapp.com/
[2] http://en.wikisource.org/wiki/De_Cive

-- 
Sam Leon
Community Coordinator
Open Knowledge Foundation
http://okfn.org/
Twitter: @noeL_maS
Skype: samedleon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20121001/0d2f3793/attachment.html>