No subject

Sun Dec 12 18:29:16 GMT 2010

people will be wanting to do with TEXTS so that we can give them an
interface that matches that - something for the next user requirements
workshop!

@Jonathan - could you add the kind of search that you were talking about to
the user stories on the Wiki please?

Sam

On Fri, Feb 24, 2012 at 11:37 AM, Nick Stenning <nick at whiteink.com> wrote:

> Hi Jonathan,
>
> Of course, you're absolutely right. Being able to do the searches you
> describe would be an incredibly powerful tool for a scholar -- right
> with you on this one. I just wanted to add a few comments on the
> technology:
>
> 1) Search is easy. We now have tools (Lucene/ElasticSearch/Solr) that
> basically solve all the "hard" problems of search: tokenization,
> indexing, adjust-at-runtime scoring, etc.
>
> 2) Search is really, really hard. Of course, the "hard" problems I've
> just described really aren't all that hard. Most importantly, they're
> concrete, which means that at least once you've designed an algorithm,
> the answer to the question "does it work?" is usually either "yes" or
> "no," rather than "well, sort of, but about 1/3 of the time it does
> something a bit funny." The hard stuff is the fluffy and intangible
> search heuristic.
>
> So, to pick up on that, I wanted to emphasise that in order to
> effectively solve the problems you describe in your email, the single
> most important thing for TEXTUS/A. N. Other Tool to understand is
> *context*.
>
> "being able to ... see all the times Nietzsche mentions Novalis"
>
> Here, the hidden heuristic is "search only documents written by
> Nietzsche" -- this would be trivial to implement manually, right? Just
> require the user to type in "author:Nietzsche". But a) for people with
> less unusual names, this doesn't uniquely identify them, potentially
> generating many spurious results, and b) this could be a simple "same
> author" checkbox. A simple heuristic that says "users frequently want
> to search works of the author they are currently reading" helps out a
> lot.
>
> You can go much further with this, and I'd suggest you do, by building
> a system that implements simple (but overridable) heuristics that
> reflect what users *usually* do. In addition, context is important in
> reverse. Don't just give people links to documents that match, give
> them (as Google frequently does) the matching extract itself, in
> context.
>
> So, that's just a few thoughts about what I think is usually missing
> from the kinds of search system you describe. I'd say that designing
> your system falls into two stages: first, identifying exactly what
> kinds of searches people really do most frequently, and second,
> attempting to design a search that embraces those heuristics, while
> remaining general and flexible.
>
> No mean feat, I might add.
>
> -N
>
>
>
>
>
>
> On Thu, Feb 23, 2012 at 22:40, Jonathan Gray <j.gray at cantab.net> wrote:
> > I've just been doing various bits of academic reading and writing, and
> > it has just struck me with a force bigger and mightier than ever
> > before: the importance of search. Such an important thing for TEXTUS
> > to get right.
> >
> > For example, being able to do things like see all the times Nietzsche
> > mentions Novalis. Or to find bits where Herder talks about the French
> > revolution. Or to see who actually read or cited works by Frederick
> > the Great. Especially if we can enable people to do (ever more)
> > comprehensive searches across a given thinker's corpus. Having more
> > and more letters and manuscripts in the system would mean this could
> > be fantastically useful.
> >
> > It might be a trivial thing which we know how to flawlessly implement,
> > or it might be a really difficult, totally non-trivial thing that
> > loads of people have struggled with, but thought it was worth putting
> > down my book and writing an email about due to the level of importance
> > I now think getting this right has. ;-)
> >
> > One possibly non-obvious thing I thought of was the idea that if you
> > search for 'Nietzsche' or another philosopher that we have data for in
> > a given text or collection, the system could cunningly give you the
> > option for searching for works by Nietzsche as well (or - two steps
> > ahead - ambiently give you the results of such a search). I'm sure
> > this would entail nightmarish semanticisation or technical acrobatics
> > beyond the scope of this project, but 'just sayin' how cool it would
> > be.
> >
> > J.
> >
> > --
> > Jonathan Gray
> > http://jonathangray.org
> >
> > _______________________________________________
> > open-humanities mailing list
> > open-humanities at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/open-humanities
>

-- 
Sam Leon
Community Coordinator
Open Knowledge Foundation
http://okfn.org/
Skype: samedleon

--0016e6de00d6fe897904b9b59a68
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Jonathan, Nick, Laura, Open=A0Humanists,<div><br></div><div>Really interest=
ing points.</div><div><br></div><div>Lack of good search capability on the =
Internet Archive given the unreliability of its OCR scans was something I s=
orely missed during my dissertation.</div>
<div><br></div><div>As Nick says much of the very useful searches we want t=
o do will be trivial to implement. the fact is I&#39;ve never seen any plat=
form that allows you to search for a given word across the whole body of an=
 authors work.</div>
<div><br></div><div>From a UI perspective it would be good to nail down the=
 kind of searches people will be wanting to do with TEXTS so that we can gi=
ve them an interface that matches that - something for the next user requir=
ements workshop!</div>
<div><br></div><div>@Jonathan - could you add the kind of search that you w=
ere talking about to the user stories on the Wiki please?</div><div><br></d=
iv><div>Sam</div><div><br><div class=3D"gmail_quote">On Fri, Feb 24, 2012 a=
t 11:37 AM, Nick Stenning <span dir=3D"ltr">&lt;<a href=3D"mailto:nick at whit=
eink.com">nick at whiteink.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Jonathan,<br>
<br>
Of course, you&#39;re absolutely right. Being able to do the searches you<b=
r>
describe would be an incredibly powerful tool for a scholar -- right<br>
with you on this one. I just wanted to add a few comments on the<br>
technology:<br>
<br>
1) Search is easy. We now have tools (Lucene/ElasticSearch/Solr) that<br>
basically solve all the &quot;hard&quot; problems of search: tokenization,<=
br>
indexing, adjust-at-runtime scoring, etc.<br>
<br>
2) Search is really, really hard. Of course, the &quot;hard&quot; problems =
I&#39;ve<br>
just described really aren&#39;t all that hard. Most importantly, they&#39;=
re<br>
concrete, which means that at least once you&#39;ve designed an algorithm,<=
br>
the answer to the question &quot;does it work?&quot; is usually either &quo=
t;yes&quot; or<br>
&quot;no,&quot; rather than &quot;well, sort of, but about 1/3 of the time =
it does<br>
something a bit funny.&quot; The hard stuff is the fluffy and intangible<br=
>
search heuristic.<br>
<br>
So, to pick up on that, I wanted to emphasise that in order to<br>
effectively solve the problems you describe in your email, the single<br>
most important thing for TEXTUS/A. N. Other Tool to understand is<br>
*context*.<br>
<br>
&quot;being able to ... see all the times Nietzsche mentions Novalis&quot;<=
br>
<br>
Here, the hidden heuristic is &quot;search only documents written by<br>
Nietzsche&quot; -- this would be trivial to implement manually, right? Just=
<br>
require the user to type in &quot;author:Nietzsche&quot;. But a) for people=
 with<br>
less unusual names, this doesn&#39;t uniquely identify them, potentially<br=
>
generating many spurious results, and b) this could be a simple &quot;same<=
br>
author&quot; checkbox. A simple heuristic that says &quot;users frequently =
want<br>
to search works of the author they are currently reading&quot; helps out a<=
br>
lot.<br>
<br>
You can go much further with this, and I&#39;d suggest you do, by building<=
br>
a system that implements simple (but overridable) heuristics that<br>
reflect what users *usually* do. In addition, context is important in<br>
reverse. Don&#39;t just give people links to documents that match, give<br>
them (as Google frequently does) the matching extract itself, in<br>
context.<br>
<br>
So, that&#39;s just a few thoughts about what I think is usually missing<br=
>
from the kinds of search system you describe. I&#39;d say that designing<br=
>
your system falls into two stages: first, identifying exactly what<br>
kinds of searches people really do most frequently, and second,<br>
attempting to design a search that embraces those heuristics, while<br>
remaining general and flexible.<br>
<br>
No mean feat, I might add.<br>
<br>
-N<br>
<div><div class=3D"h5"><br>
<br>
<br>
<br>
<br>
<br>
On Thu, Feb 23, 2012 at 22:40, Jonathan Gray &lt;<a href=3D"mailto:j.gray at c=
antab.net">j.gray at cantab.net</a>&gt; wrote:<br>
&gt; I&#39;ve just been doing various bits of academic reading and writing,=
 and<br>
&gt; it has just struck me with a force bigger and mightier than ever<br>
&gt; before: the importance of search. Such an important thing for TEXTUS<b=
r>
&gt; to get right.<br>
&gt;<br>
&gt; For example, being able to do things like see all the times Nietzsche<=
br>
&gt; mentions Novalis. Or to find bits where Herder talks about the French<=
br>
&gt; revolution. Or to see who actually read or cited works by Frederick<br=
>
&gt; the Great. Especially if we can enable people to do (ever more)<br>
&gt; comprehensive searches across a given thinker&#39;s corpus. Having mor=
e<br>
&gt; and more letters and manuscripts in the system would mean this could<b=
r>
&gt; be fantastically useful.<br>
&gt;<br>
&gt; It might be a trivial thing which we know how to flawlessly implement,=
<br>
&gt; or it might be a really difficult, totally non-trivial thing that<br>
&gt; loads of people have struggled with, but thought it was worth putting<=
br>
&gt; down my book and writing an email about due to the level of importance=
<br>
&gt; I now think getting this right has. ;-)<br>
&gt;<br>
&gt; One possibly non-obvious thing I thought of was the idea that if you<b=
r>
&gt; search for &#39;Nietzsche&#39; or another philosopher that we have dat=
a for in<br>
&gt; a given text or collection, the system could cunningly give you the<br=
>
&gt; option for searching for works by Nietzsche as well (or - two steps<br=
>
&gt; ahead - ambiently give you the results of such a search). I&#39;m sure=
<br>
&gt; this would entail nightmarish semanticisation or technical acrobatics<=
br>
&gt; beyond the scope of this project, but &#39;just sayin&#39; how cool it=
 would<br>
&gt; be.<br>
&gt;<br>
&gt; J.<br>
&gt;<br>
&gt; --<br>
&gt; Jonathan Gray<br>
&gt; <a href=3D"http://jonathangray.org" target=3D"_blank">http://jonathang=
ray.org</a><br>
&gt;<br>
</div></div>&gt; _______________________________________________<br>
&gt; open-humanities mailing list<br>
&gt; <a href=3D"mailto:open-humanities at lists.okfn.org">open-humanities at list=
s.okfn.org</a><br>
&gt; <a href=3D"http://lists.okfn.org/mailman/listinfo/open-humanities" tar=
get=3D"_blank">http://lists.okfn.org/mailman/listinfo/open-humanities</a><b=
r>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><font color=
=3D"#888888">Sam Leon <br>Community Coordinator<br>Open Knowledge Foundatio=
n=A0<br><a href=3D"http://okfn.org/" target=3D"_blank">http://okfn.org/</a>=
<br>

Skype: samedleon<br></font><br>
</div>

--0016e6de00d6fe897904b9b59a68--