[humanities-dev] Letters data into Textus

Sat Feb 18 10:40:03 UTC 2012

Hi Tom,

The current code is on bitbucket at
https://bitbucket.org/okfn/openletters and the XML is in the /docs. The
current import format is pretty much:
<div>
 <volume>1</volume>
 <letter>Letter text goes here</letter>
 <correspondent>Henry Austin</correspondent>
 <salutation>HENRY</salutation>
 <place>Furnivals_Inn</place>
 <lettertext>1</lettertext>
 <date>0000-00-00</date>
</div>

The source material:
<source>
  <id>1</id>
  <author> Mamie Dickens and Georgina Hogarth</author>
  <title>THE LETTERS OF CHARLES DICKENS EDITED BY HIS SISTER-IN-LAW AND
HIS ELDEST DAUGHTER In Two Volumes VOL. I 1833 to 1856</title>
  <publication>London: CHAPMAN AND HALL</publication>
  <date>1880</date>
  <url>http://www.gutenberg.org/files/25852/25852.txt</url>
</source>

Open Correspondence had the text held separately from the data so the
structure as outlined in the earlier email is pretty much the format but
there are one or two things that I've like to concatenate, like the
sources file and the letters in Textus. 

Do you have a standard data model for what Textus would require from the
source material for upload? I'd like to put it into that format and do
some tidying up where necessary. 

I've got some scripts which probably need some development but earlier
experience suggests that a toolkit for extracting letters and data from
raw files would be useful, I'd be happy to try and help in this area if
it required or thought helpful. I'm still learning about text extraction
and haven't done anything in a while but it is something that I might be
coming back to. 

The more I think about it (and type this), the more I need to build
something along these lines for my own purposes to push and extend the
letters data sets into what I thought they might originally be. 

Yours, 

Iain

On Fri, 2012-02-17 at 08:59 +0000, Tom Oinn wrote:
> Hi Iain,
> 
> We don't have a data format for importing yet - is there an example of
> what you currently have available somewhere? The model we're using for
> underlying storage is a text with markup held separately, so there's
> no 'native' format for text + markup + annotations, if that makes
> sense. Obviously we'll need such a thing for export and import to and
> from sources which can understand it, so your experience thus far will
> be very helpful on that front.
> 
> As we were discussing in the telcon the other night we want Textus to
> be applicable, ultimately, to data sets like those held in
> opencorrespondance etc.
> 
> Cheers,
> 
> Tom