[humanities-dev] TEXTUS format

Tom Oinn tom.oinn at okfn.org
Mon Apr 30 12:58:25 UTC 2012


(Originally from Ben Brumfield, posting to list because it's a
sensible question!)

On 26 April 2012 14:50, Ben Brumfield <benwbrum at gmail.com> wrote:
> I spotted your humanities-dev message regarding the book/collection
> tag library and wondered whether you'd also investigated TEI and
> decided it was inappropriate.  I don't know enough about the goals of
> TEXTUS to post on-list, but was curious if you'd found some
> deficiencies in what might otherwise be the obvious choice for
> humanities text mark-up.
>
> (Feel free to take this on-list if you like -- I didn't want to waste
> the list's time with what may be a tangent.)
>
> Ben Brumfield
> http://manuscripttranscription.blogspot.com/

TEI (Text encoding initiative - http://www.tei-c.org/index.xml) is an
XML based representation intended to capture a large amount of
typographical, bibliographic and semantic metadata and encode it
within a single document along with the text. It appears to be a
sensible way to accomplish this.

In TEXTUS we're taking a slightly different approach, storing the text
and annotations separately.

We're doing this for a number of reasons - perhaps most importantly we
want to be able to expose texts early, when there is very little
annotation available. If we're doing this we need to retain stability
of references as annotation and structure are added, so if you cite or
otherwise refer to a particular piece of text and we add markup,
comments or structure (i.e. chapter boundaries) your reference must
still point to the same underlying text. This is technically awkward
when annotation and text are mixed in together, and while XML is a
reasonable format for a single file it's very poor when you want to
extract a piece of data, in our case by character range.

By storing the text and metadata separately we can easily retrieve a
sub-string of the text (it's literally a substring operation, albeit
in a database) and we can then query for all annotation which pertains
to that character range. The equivalent operation on an XML file is
rather ugly. We also have annotations which we expect to substantially
overlap, something to which XML's tree structure is not well suited.

This is not to say we're ignoring TEI, but we expect to use it as an
import format rather than the underlying information model for TEXTUS
- it's been fairly high on our wishlist to be able to pull in and
preserve as much metadata as possible from a TEI format document and
the TEI spec has been something I've been looking at closely to try to
ensure this is possible with minimal information loss.

Hope that helps, I've updated the import specification at
https://github.com/okfn/textus/blob/master/docs/json_import_format.md
which might give more of an idea of what we're expecting TEXTUS to
contain at least at document import time.

Cheers,

Tom

-- 
Tom Oinn
+44 (0) 20 8123 5142 or Skype ID 'tomoinn'
http://www.crypticsquid.com




More information about the humanities-dev mailing list