[ckan-dev] [ckan-discuss] Harvesting Dublin Core documents

Thu Nov 25 14:36:26 UTC 2010

* [2010-11-24 16:53:56 +0000] John Bywater <john.bywater at appropriatesoftware.net> écrit:

] I would like to adjust that method, to do something like:
] 
]     def read_values(self):
]         document_class = self.get_document_class()
]         document = document_class(self.content)
]         return document.read_values()
] 
]     def get_document_class(self):
]         if self.is_gemini_content():
]             document_class = GeminiDocument
]         elif self.is_rdf_content():
]             document_class = RdfDocument
]         else:
]             raise HarvesterError, "Can't identify document class from 
] content: %s" % self.content
]         return document_class

Depending on the details of the scenario, I would perfer not to guess
the document type from the content. Rather when it is fetched we
should look at (and preserve) the content type and judge from that. 

That's the "correct" way to do it but I'm quite conscious of the fact
that remote servers may or may not be relied upon to set their HTTP
headers correctly, and we may even be harvesting from e.g. and FTP
site or a CDROM that doesn't have such headers...

In any event, I would suggest that the different translators stay
different. The "trigger" of a harvesting job should not assume the
programming language they're written in and instead interact with them
at the unix command level. The trigger mechanism, either cron or a
queue listener, would be configured beforehand to say what script to
use according to what type of data.

My £0.02

-w
-- 
William Waites
http://eris.okfn.org/ww/foaf#i
9C7E F636 52F6 1004 E40A  E565 98E3 BBF3 8320 7664