[open-humanities] Forking Project Gutenberg to Github

Seth Woodworth seth at sethish.com
Wed Aug 20 14:31:38 UTC 2014


Ah, I wrote a section about that in my email but removed it for brevity.

> Is there an easy way of searching Gitenberg for sets of texts, e.g. for
all the works by a particular author?

Not yet.

Metadata is my next focus for the project.  Right now, the only search
function is the github repo search, which is insufficient.

Now, metadata is stored in RDF/XML.  PG has a metadata file for each book
(which I'm also tracking with git here
<https://github.com/sethwoodworth/PG_rdf_metadata>).  I've put a copy of
the metadata file in each book repo as *pg<bookid>.rdf* .  This isn't ideal.

I'm extending the recently released python tool gutenberg
<https://bitbucket.org/c-w/gutenberg> to parse more fields of PG's RDF
schema. When that is finished, I can include a .json file in each book repo
and release a big json file of all metadata.

Next step, is an API server and simple search tool for people to be able to
interact with the metadata without having to download >200mb.

> What metadata is there for each work, and what provision is there for
adding to it?

Book title, Author(s) (and variant spellings), Library of Congress Subject
Heading <https://github.com/sethwoodworth/LCC>, and some other metadata. PG
changed their schema recently and I'm not 100% sure how many variants the
40k rdf files contain.  I will see if I can get an answer to that question
later today.

As far as adding new metadata, there are several decisions to make:
+ which is the canonical metadata file, .rdf, .json or both?
+ what schema to use for arbitrary or specific new metadata

I would prefer to make these choices as a community rather than just me.
But that being said, I'm accepting most PR's that come in and keeping track
in case we need to migrate anything in the future.


P.S. Thanks for opening issues on GITenberg repos!


On Wed, Aug 20, 2014 at 5:48 AM, John Levin <john at anterotesis.com> wrote:

> Hello Gitenberg!
>
>
> On 19/08/2014 19:09, Seth Woodworth wrote:
>
>> Hello Humanities!
>>
>> I've been working on a project called GITenberg
>> <http://gitenberg.github.io>.
>>
>>
>> The aim is to move Project Gutenberg's books to github.
>>
>>
> <snip>
>
> This is a really interesting project, and one I hope could be adapted for
> other large collections of texts.
>
> Couple of quick questions:
> Is there an easy way of searching Gitenberg for sets of texts, e.g. for
> all the works by a particular author?
> What metadata is there for each work, and what provision is there for
> adding to it?
>
> Best,
>
> John
>
> --
> John Levin
> http://www.anterotesis.com
> http://twitter.com/anterotesis
>
> _______________________________________________
> open-humanities mailing list
> open-humanities at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/open-humanities
> Unsubscribe: https://lists.okfn.org/mailman/options/open-humanities
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-humanities/attachments/20140820/3bc9287d/attachment-0003.html>


More information about the open-humanities mailing list