[ckan-discuss] Harvesting Dublin Core documents

Wed Nov 24 16:53:56 GMT 2010

John Bywater wrote:
> William Waites wrote:
>> Sound reasonable?
> 
> Sounds very reasonable. ;-)

Narrowing this down, CKAN has the following method:

     def read_values(self):
         if "gmd:MD_Metadata" in self.content:
             gemini_document = GeminiDocument(self.content)
         else:
             raise HarvesterError, "Can't identify type of document 
content: %s" % self.content
         return gemini_document.read_values()

I would like to adjust that method, to do something like:

     def read_values(self):
         document_class = self.get_document_class()
         document = document_class(self.content)
         return document.read_values()

     def get_document_class(self):
         if self.is_gemini_content():
             document_class = GeminiDocument
         elif self.is_rdf_content():
             document_class = RdfDocument
         else:
             raise HarvesterError, "Can't identify document class from 
content: %s" % self.content
         return document_class

That is, it would be very useful to have a class that is constructed 
with an RDF string, which returns a CKAN Package dict from a 
read_values() method. All the harvesting machinery would then work with RDF.

That class could process the RDF "programmatically either with SPARQL or 
directly according to the library or bindings that you are using."

It could be used on either side of the API. CKAN's harvester talks to 
the catalogue model via the presentation layer, CKAN's presentation 
layer has CKAN Package dicts, and the CKAN API just exposes that 
presentation model on the system boundary.

J.