[OpenGLAM] 2.5 million public domain images now available.
bosteen at gmail.com
Mon Sep 1 09:19:15 UTC 2014
It is great to see someone take the same technique and apply it to a larger
digitised corpus! I just wanted to add something about why I didn't include
the OCR text the follows and precedes the image. When I was doing my random
sampling of the images, I did include the text around it. However, it
became clear that for a large proportion of the images, they were in the
position they were for printing or logistical reason, not to make the text
next to it relevant. To give us a >2/3 chance (again, based on our set and
random sampling), a large amount of text needed to be included, and the age
and place of publication had an influence on where the relevant text was.
I'll be interested if searches find relevant images based on this text in a
larger proportion than just incidental. There's is a set of 2.8mil+ images
(and growing) and will show a differing bias.
If anyone would like the OCR XML for the British library collection to help
me explore how to find relevant text or abridge the section of the work (or
actually for any purpose), then contact me. It's a 230GB set, so not
something I have the resources to host here so it'll have to be sneakernet!
On 1 September 2014 09:53, Joris Pekel <jpekel at gmail.com> wrote:
> Hi everyone, welcome to September.
> This already got a lot of attention over the weekend, but wanted to share
> with you anyway cause it is really great. A research fellow has been
> extracting over 2.5 million images from public domain books from the
> internet archive. By using the OCR text that surround the images, it is
> possible to quite accurately search for keywords. The metadata is of course
> not perfect, but I've already seen some Wikimedians talking about ways to
> improve this.
> One could also think about the methods that the British Library used for
> their 1 million public domain images where they show you the 'least tagged'
> ones. This has resulted in every images at least tagged once by now. See:
> For more information about the release see:
> open-glam mailing list
> open-glam at lists.okfn.org
> Unsubscribe: https://lists.okfn.org/mailman/options/open-glam
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the open-glam