[humanities-dev] Extracting Texts from Wikisource
sam.leon at okfn.org
Mon Oct 1 09:37:12 UTC 2012
I'm preparing some texts on an instance of TEXTUS  for a series of
readin groups to start next week.
The current TEXTUS set up requires that these texts are saved as .txt files
and marked up with Wikitext.
For a couple of the texts I'm curating Gutenberg does not hold a copy. This
means I have to use Wikisource which doesn't give the option of downloading
a UTF-8 version.
Does anyone have any ideas on how to extract the plain text from Wikisource
other than manually copying and pasting each section? This would obviously
be something we should aim to automate via a script.
For an example of what I'm talking about see the Wikisource entry for Hobbes
"De Cive" <http://en.wikisource.org/wiki/De_Cive> 
Ideas much appreciated!
Open Knowledge Foundation
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the humanities-dev