[open-humanities] Forking Project Gutenberg to Github

Rufus Pollock rufus.pollock at okfn.org
Wed Aug 20 17:40:41 UTC 2014


On 20 August 2014 15:31, Seth Woodworth <seth at sethish.com> wrote:

> Ah, I wrote a section about that in my email but removed it for brevity.
>
> > Is there an easy way of searching Gitenberg for sets of texts, e.g. for
> all the works by a particular author?
>
> Not yet.
>
> Metadata is my next focus for the project.  Right now, the only search
> function is the github repo search, which is insufficient.
>

I wonder if we have a connection with Textus / OpenLiterature - the whole
idea of the recent incarnation was to *not* store to the texts in the
service but have them live in flat-files or equivalent somewhere. Plus we
always want to import gutenberg stuff!

At the very least the textus-viewer could be used for viewing the texts (if
we could add the relevant typography stuff)


> Now, metadata is stored in RDF/XML.  PG has a metadata file for each book
> (which I'm also tracking with git here
> <https://github.com/sethwoodworth/PG_rdf_metadata>).  I've put a copy of
> the metadata file in each book repo as *pg<bookid>.rdf* .  This isn't
> ideal.
>
> I'm extending the recently released python tool gutenberg
> <https://bitbucket.org/c-w/gutenberg> to parse more fields of PG's RDF
> schema. When that is finished, I can include a .json file in each book repo
> and release a big json file of all metadata.
>

A simple JSON file would be very nice. It might be a stretch but I wonder
if using datapackage.json structure would be appropriate here - see
http://data.okfn.org/doc/data-package. After all the text files seem a
natural fit as resources here.


> Next step, is an API server and simple search tool for people to be able
> to interact with the metadata without having to download >200mb.
>


> > What metadata is there for each work, and what provision is there for
> adding to it?
>
> Book title, Author(s) (and variant spellings), Library of Congress
> Subject Heading <https://github.com/sethwoodworth/LCC>, and some other
> metadata. PG changed their schema recently and I'm not 100% sure how many
> variants the 40k rdf files contain.  I will see if I can get an answer to
> that question later today.
>
> As far as adding new metadata, there are several decisions to make:
> + which is the canonical metadata file, .rdf, .json or both?
> + what schema to use for arbitrary or specific new metadata
>

I'd go for JSON. I've suggested datapackage.json for the "container" but
could be worth looking at Textus stuff for suggestions on particular fields
(which cites bibjson though bibjson seems to be down).


> I would prefer to make these choices as a community rather than just me.
> But that being said, I'm accepting most PR's that come in and keeping
> track in case we need to migrate anything in the future.
>

Rufus


>
>
> P.S. Thanks for opening issues on GITenberg repos!
>
>
> On Wed, Aug 20, 2014 at 5:48 AM, John Levin <john at anterotesis.com> wrote:
>
>> Hello Gitenberg!
>>
>>
>> On 19/08/2014 19:09, Seth Woodworth wrote:
>>
>>> Hello Humanities!
>>>
>>> I've been working on a project called GITenberg
>>> <http://gitenberg.github.io>.
>>>
>>>
>>> The aim is to move Project Gutenberg's books to github.
>>>
>>>
>> <snip>
>>
>> This is a really interesting project, and one I hope could be adapted for
>> other large collections of texts.
>>
>> Couple of quick questions:
>> Is there an easy way of searching Gitenberg for sets of texts, e.g. for
>> all the works by a particular author?
>> What metadata is there for each work, and what provision is there for
>> adding to it?
>>
>> Best,
>>
>> John
>>
>> --
>> John Levin
>> http://www.anterotesis.com
>> http://twitter.com/anterotesis
>>
>> _______________________________________________
>> open-humanities mailing list
>> open-humanities at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/open-humanities
>> Unsubscribe: https://lists.okfn.org/mailman/options/open-humanities
>>
>
>
> _______________________________________________
> open-humanities mailing list
> open-humanities at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/open-humanities
> Unsubscribe: https://lists.okfn.org/mailman/options/open-humanities
>
>


-- 

*Rufus PollockFounder and President | skype: rufuspollock | @rufuspollock
<https://twitter.com/rufuspollock>Open Knowledge <http://okfn.org/> - see
how data can change the world**http://okfn.org/ <http://okfn.org/> | @okfn
<http://twitter.com/OKFN> | Open Knowledge on Facebook
<https://www.facebook.com/OKFNetwork> |  Blog <http://blog.okfn.org/>*

The Open Knowledge Foundation is a not-for-profit organisation.  It is
incorporated in England & Wales as a company limited by guarantee, with
company number 05133759.  VAT Registration № GB 984404989. Registered
office address: Open Knowledge Foundation, St John’s Innovation Centre,
Cowley Road, Cambridge, CB4 0WS, UK.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-humanities/attachments/20140820/535497ae/attachment-0003.html>


More information about the open-humanities mailing list