[humanities-dev] TEXTUS import questions

Tom Oinn tom.oinn at okfn.org
Thu May 24 21:21:52 UTC 2012


Hi David

On 24 May 2012 22:10, David Chiles <dwalterc at gmail.com> wrote:
> Hi,
>
> I'm new to this list and to TEXTUS as well. I'm working with a large
> collection of works and annotations that I want to import into a local
> TEXTUS instance.

Excellent - what's the corpus?

> Right now each work is split up into multiple HTML documents by chapter,
> section, book,  … and each one has normal HTML markup. The annotations are
> all, in the TEXTUS terms, "textus:comment". Currently the location of the
> annotation is stored as an xPath and character offset for the start and end.
> As well, the original quoted text is known.

> I've looked over the json_import_format from the github page and from what I
> gather all the HTML tags would have to be stripped from the documents and
> put into typography. Then for the annotations all the character offsets
> would need to be converted into overall offset for the entire document.

That's correct. I was chatting with Nick Stenning earlier today about
getting together one day at the OpenBiblio hack meet in a couple of
weeks to implement exactly this and get the OpenShakespeare data in as
a test.

> Also I wasn't clear on how the import file was actually imported once the
> json file was created.

Quite - at the moment the set of command line tools is rather random
and mostly formed of what I was finding useful when testing!

If you look at https://github.com/tomoinn/textus/blob/master/src/tools/import-wikisource.js
you'll see the fairly simple code which imports the data as a new
document through the datasource implementation (at the moment this
assumes a default configuration ElasticSearch database running on the
local machine). Ignore the 'createDummyAnnotations' function, the
other function shows a couple of things though.

Firstly at the moment the Textus interface uses the top level
structure nodes to show texts in the 'show all texts' view, which
means that, if your import actually consists of multiple texts, you
can create multiple level 0 nodes, and the description and name
properties are used as one might expect.

Secondly all you actually need to do, having acquired a datastore
object, is use the importData function on it, passing in the data
structure containing your text, annotations of both kinds and
structure nodes along with a function which will be called on success
or failure.

Good to have the interest, can you tell us a bit more about your
project though? Hopefully there'll be some re-use possible if you're
writing import logic!

Tom

-- 
Tom Oinn
+44 (0) 20 8123 5142 or Skype ID 'tomoinn'
http://www.crypticsquid.com




More information about the humanities-dev mailing list