[humanities-dev] [open-humanities] The importance of search

Nick Stenning nick at whiteink.com
Fri Feb 24 11:37:37 UTC 2012


Hi Jonathan,

Of course, you're absolutely right. Being able to do the searches you
describe would be an incredibly powerful tool for a scholar -- right
with you on this one. I just wanted to add a few comments on the
technology:

1) Search is easy. We now have tools (Lucene/ElasticSearch/Solr) that
basically solve all the "hard" problems of search: tokenization,
indexing, adjust-at-runtime scoring, etc.

2) Search is really, really hard. Of course, the "hard" problems I've
just described really aren't all that hard. Most importantly, they're
concrete, which means that at least once you've designed an algorithm,
the answer to the question "does it work?" is usually either "yes" or
"no," rather than "well, sort of, but about 1/3 of the time it does
something a bit funny." The hard stuff is the fluffy and intangible
search heuristic.

So, to pick up on that, I wanted to emphasise that in order to
effectively solve the problems you describe in your email, the single
most important thing for TEXTUS/A. N. Other Tool to understand is
*context*.

"being able to ... see all the times Nietzsche mentions Novalis"

Here, the hidden heuristic is "search only documents written by
Nietzsche" -- this would be trivial to implement manually, right? Just
require the user to type in "author:Nietzsche". But a) for people with
less unusual names, this doesn't uniquely identify them, potentially
generating many spurious results, and b) this could be a simple "same
author" checkbox. A simple heuristic that says "users frequently want
to search works of the author they are currently reading" helps out a
lot.

You can go much further with this, and I'd suggest you do, by building
a system that implements simple (but overridable) heuristics that
reflect what users *usually* do. In addition, context is important in
reverse. Don't just give people links to documents that match, give
them (as Google frequently does) the matching extract itself, in
context.

So, that's just a few thoughts about what I think is usually missing
from the kinds of search system you describe. I'd say that designing
your system falls into two stages: first, identifying exactly what
kinds of searches people really do most frequently, and second,
attempting to design a search that embraces those heuristics, while
remaining general and flexible.

No mean feat, I might add.

-N






On Thu, Feb 23, 2012 at 22:40, Jonathan Gray <j.gray at cantab.net> wrote:
> I've just been doing various bits of academic reading and writing, and
> it has just struck me with a force bigger and mightier than ever
> before: the importance of search. Such an important thing for TEXTUS
> to get right.
>
> For example, being able to do things like see all the times Nietzsche
> mentions Novalis. Or to find bits where Herder talks about the French
> revolution. Or to see who actually read or cited works by Frederick
> the Great. Especially if we can enable people to do (ever more)
> comprehensive searches across a given thinker's corpus. Having more
> and more letters and manuscripts in the system would mean this could
> be fantastically useful.
>
> It might be a trivial thing which we know how to flawlessly implement,
> or it might be a really difficult, totally non-trivial thing that
> loads of people have struggled with, but thought it was worth putting
> down my book and writing an email about due to the level of importance
> I now think getting this right has. ;-)
>
> One possibly non-obvious thing I thought of was the idea that if you
> search for 'Nietzsche' or another philosopher that we have data for in
> a given text or collection, the system could cunningly give you the
> option for searching for works by Nietzsche as well (or - two steps
> ahead - ambiently give you the results of such a search). I'm sure
> this would entail nightmarish semanticisation or technical acrobatics
> beyond the scope of this project, but 'just sayin' how cool it would
> be.
>
> J.
>
> --
> Jonathan Gray
> http://jonathangray.org
>
> _______________________________________________
> open-humanities mailing list
> open-humanities at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-humanities




More information about the humanities-dev mailing list