[humanities-dev] [open-humanities] The importance of search

Sam Leon sam.leon at okfn.org
Fri Feb 24 13:19:07 UTC 2012


Jonathan, Nick, Laura, Open Humanists,

Really interesting points.

Lack of good search capability on the Internet Archive given the
unreliability of its OCR scans was something I sorely missed during my
dissertation.

As Nick says much of the very useful searches we want to do will be trivial
to implement. the fact is I've never seen any platform that allows you to
search for a given word across the whole body of an authors work.

>From a UI perspective it would be good to nail down the kind of searches
people will be wanting to do with TEXTS so that we can give them an
interface that matches that - something for the next user requirements
workshop!

@Jonathan - could you add the kind of search that you were talking about to
the user stories on the Wiki please?

Sam

On Fri, Feb 24, 2012 at 11:37 AM, Nick Stenning <nick at whiteink.com> wrote:

> Hi Jonathan,
>
> Of course, you're absolutely right. Being able to do the searches you
> describe would be an incredibly powerful tool for a scholar -- right
> with you on this one. I just wanted to add a few comments on the
> technology:
>
> 1) Search is easy. We now have tools (Lucene/ElasticSearch/Solr) that
> basically solve all the "hard" problems of search: tokenization,
> indexing, adjust-at-runtime scoring, etc.
>
> 2) Search is really, really hard. Of course, the "hard" problems I've
> just described really aren't all that hard. Most importantly, they're
> concrete, which means that at least once you've designed an algorithm,
> the answer to the question "does it work?" is usually either "yes" or
> "no," rather than "well, sort of, but about 1/3 of the time it does
> something a bit funny." The hard stuff is the fluffy and intangible
> search heuristic.
>
> So, to pick up on that, I wanted to emphasise that in order to
> effectively solve the problems you describe in your email, the single
> most important thing for TEXTUS/A. N. Other Tool to understand is
> *context*.
>
> "being able to ... see all the times Nietzsche mentions Novalis"
>
> Here, the hidden heuristic is "search only documents written by
> Nietzsche" -- this would be trivial to implement manually, right? Just
> require the user to type in "author:Nietzsche". But a) for people with
> less unusual names, this doesn't uniquely identify them, potentially
> generating many spurious results, and b) this could be a simple "same
> author" checkbox. A simple heuristic that says "users frequently want
> to search works of the author they are currently reading" helps out a
> lot.
>
> You can go much further with this, and I'd suggest you do, by building
> a system that implements simple (but overridable) heuristics that
> reflect what users *usually* do. In addition, context is important in
> reverse. Don't just give people links to documents that match, give
> them (as Google frequently does) the matching extract itself, in
> context.
>
> So, that's just a few thoughts about what I think is usually missing
> from the kinds of search system you describe. I'd say that designing
> your system falls into two stages: first, identifying exactly what
> kinds of searches people really do most frequently, and second,
> attempting to design a search that embraces those heuristics, while
> remaining general and flexible.
>
> No mean feat, I might add.
>
> -N
>
>
>
>
>
>
> On Thu, Feb 23, 2012 at 22:40, Jonathan Gray <j.gray at cantab.net> wrote:
> > I've just been doing various bits of academic reading and writing, and
> > it has just struck me with a force bigger and mightier than ever
> > before: the importance of search. Such an important thing for TEXTUS
> > to get right.
> >
> > For example, being able to do things like see all the times Nietzsche
> > mentions Novalis. Or to find bits where Herder talks about the French
> > revolution. Or to see who actually read or cited works by Frederick
> > the Great. Especially if we can enable people to do (ever more)
> > comprehensive searches across a given thinker's corpus. Having more
> > and more letters and manuscripts in the system would mean this could
> > be fantastically useful.
> >
> > It might be a trivial thing which we know how to flawlessly implement,
> > or it might be a really difficult, totally non-trivial thing that
> > loads of people have struggled with, but thought it was worth putting
> > down my book and writing an email about due to the level of importance
> > I now think getting this right has. ;-)
> >
> > One possibly non-obvious thing I thought of was the idea that if you
> > search for 'Nietzsche' or another philosopher that we have data for in
> > a given text or collection, the system could cunningly give you the
> > option for searching for works by Nietzsche as well (or - two steps
> > ahead - ambiently give you the results of such a search). I'm sure
> > this would entail nightmarish semanticisation or technical acrobatics
> > beyond the scope of this project, but 'just sayin' how cool it would
> > be.
> >
> > J.
> >
> > --
> > Jonathan Gray
> > http://jonathangray.org
> >
> > _______________________________________________
> > open-humanities mailing list
> > open-humanities at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/open-humanities
>



-- 
Sam Leon
Community Coordinator
Open Knowledge Foundation
http://okfn.org/
Skype: samedleon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20120224/bcba0cd3/attachment.html>


More information about the humanities-dev mailing list