[open-humanities] The importance of search

Jonathan Gray j.gray at cantab.net
Fri Feb 24 11:58:23 UTC 2012

Extremely useful comments - thank you very much Nick! :-)

On Fri, Feb 24, 2012 at 11:37 AM, Nick Stenning <nick at whiteink.com> wrote:
> Hi Jonathan,
> Of course, you're absolutely right. Being able to do the searches you
> describe would be an incredibly powerful tool for a scholar -- right
> with you on this one. I just wanted to add a few comments on the
> technology:
> 1) Search is easy. We now have tools (Lucene/ElasticSearch/Solr) that
> basically solve all the "hard" problems of search: tokenization,
> indexing, adjust-at-runtime scoring, etc.
> 2) Search is really, really hard. Of course, the "hard" problems I've
> just described really aren't all that hard. Most importantly, they're
> concrete, which means that at least once you've designed an algorithm,
> the answer to the question "does it work?" is usually either "yes" or
> "no," rather than "well, sort of, but about 1/3 of the time it does
> something a bit funny." The hard stuff is the fluffy and intangible
> search heuristic.
> So, to pick up on that, I wanted to emphasise that in order to
> effectively solve the problems you describe in your email, the single
> most important thing for TEXTUS/A. N. Other Tool to understand is
> *context*.
> "being able to ... see all the times Nietzsche mentions Novalis"
> Here, the hidden heuristic is "search only documents written by
> Nietzsche" -- this would be trivial to implement manually, right? Just
> require the user to type in "author:Nietzsche". But a) for people with
> less unusual names, this doesn't uniquely identify them, potentially
> generating many spurious results, and b) this could be a simple "same
> author" checkbox. A simple heuristic that says "users frequently want
> to search works of the author they are currently reading" helps out a
> lot.
> You can go much further with this, and I'd suggest you do, by building
> a system that implements simple (but overridable) heuristics that
> reflect what users *usually* do. In addition, context is important in
> reverse. Don't just give people links to documents that match, give
> them (as Google frequently does) the matching extract itself, in
> context.
> So, that's just a few thoughts about what I think is usually missing
> from the kinds of search system you describe. I'd say that designing
> your system falls into two stages: first, identifying exactly what
> kinds of searches people really do most frequently, and second,
> attempting to design a search that embraces those heuristics, while
> remaining general and flexible.
> No mean feat, I might add.
> -N
> On Thu, Feb 23, 2012 at 22:40, Jonathan Gray <j.gray at cantab.net> wrote:
>> I've just been doing various bits of academic reading and writing, and
>> it has just struck me with a force bigger and mightier than ever
>> before: the importance of search. Such an important thing for TEXTUS
>> to get right.
>> For example, being able to do things like see all the times Nietzsche
>> mentions Novalis. Or to find bits where Herder talks about the French
>> revolution. Or to see who actually read or cited works by Frederick
>> the Great. Especially if we can enable people to do (ever more)
>> comprehensive searches across a given thinker's corpus. Having more
>> and more letters and manuscripts in the system would mean this could
>> be fantastically useful.
>> It might be a trivial thing which we know how to flawlessly implement,
>> or it might be a really difficult, totally non-trivial thing that
>> loads of people have struggled with, but thought it was worth putting
>> down my book and writing an email about due to the level of importance
>> I now think getting this right has. ;-)
>> One possibly non-obvious thing I thought of was the idea that if you
>> search for 'Nietzsche' or another philosopher that we have data for in
>> a given text or collection, the system could cunningly give you the
>> option for searching for works by Nietzsche as well (or - two steps
>> ahead - ambiently give you the results of such a search). I'm sure
>> this would entail nightmarish semanticisation or technical acrobatics
>> beyond the scope of this project, but 'just sayin' how cool it would
>> be.
>> J.
>> --
>> Jonathan Gray
>> http://jonathangray.org
>> _______________________________________________
>> open-humanities mailing list
>> open-humanities at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-humanities
> _______________________________________________
> open-humanities mailing list
> open-humanities at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-humanities

Jonathan Gray

More information about the open-humanities mailing list