[ckan-discuss] Harvesting Dublin Core documents

John Bywater john.bywater at appropriatesoftware.net
Wed Nov 24 16:53:56 GMT 2010


John Bywater wrote:
> William Waites wrote:
>> Sound reasonable?
> 
> Sounds very reasonable. ;-)

Narrowing this down, CKAN has the following method:

     def read_values(self):
         if "gmd:MD_Metadata" in self.content:
             gemini_document = GeminiDocument(self.content)
         else:
             raise HarvesterError, "Can't identify type of document 
content: %s" % self.content
         return gemini_document.read_values()


I would like to adjust that method, to do something like:

     def read_values(self):
         document_class = self.get_document_class()
         document = document_class(self.content)
         return document.read_values()

     def get_document_class(self):
         if self.is_gemini_content():
             document_class = GeminiDocument
         elif self.is_rdf_content():
             document_class = RdfDocument
         else:
             raise HarvesterError, "Can't identify document class from 
content: %s" % self.content
         return document_class


That is, it would be very useful to have a class that is constructed 
with an RDF string, which returns a CKAN Package dict from a 
read_values() method. All the harvesting machinery would then work with RDF.

That class could process the RDF "programmatically either with SPARQL or 
directly according to the library or bindings that you are using."

It could be used on either side of the API. CKAN's harvester talks to 
the catalogue model via the presentation layer, CKAN's presentation 
layer has CKAN Package dicts, and the CKAN API just exposes that 
presentation model on the system boundary.

J.





More information about the ckan-discuss mailing list