[ckan-dev] [ckan-discuss] Harvesting Dublin Core documents

Thu Nov 25 15:23:21 UTC 2010

William Waites wrote:
> 
> Depending on the details of the scenario, I would prefer not to guess
> the document type from the content. Rather when it is fetched we
> should look at (and preserve) the content type and judge from that. 
> 
> That's the "correct" way to do it but I'm quite conscious of the fact
> that remote servers may or may not be relied upon to set their HTTP
> headers correctly, and we may even be harvesting from e.g. and FTP
> site or a CDROM that doesn't have such headers...
> 
> In any event, I would suggest that the different translators stay
> different. The "trigger" of a harvesting job should not assume the
> programming language they're written in and instead interact with them
> at the unix command level. The trigger mechanism, either cron or a
> queue listener, would be configured beforehand to say what script to
> use according to what type of data.
> 
> My £0.02
> 

Thanks. Let's keep this discussion open. I'm not entirely sure what you 
mean in the last paragraph. CKAN is written in Python, which could run 
on Windows (or something else). That is, there might not be a unix 
command level available. But do tell me more about this if you'd like to.

On the subject of content type detection: what I did is lame, but it's 
probably adequate for the remaining seconds of 2010. Guessing the 
document type from the content, especially in the way I indicated it 
happens at the moment ("if 'gmd:MD_Metadata' in self.content") isn't 
going to be very robust against cases where the tested fragment appears 
within a different document type. That is, I'm sure we can all think of 
documents which have "gmd:MD_Metadata" which aren't GEMINI 2 documents 
encoded with ISO19139. :-)

So it would be good to discuss content/document type detection 
strategies. As you say, the system could look at an HTTP header to get 
and preserve the content type (but we can't rely on that). Another 
option is to pass the content into a series of document validators, and 
use the first one that works. Another is to pass them into all the 
validators and score them somehow, perhaps according to which treatment 
gives the most CKAN package values. Of course, since there were lots of 
other things to do, I just did "the simplest thing that could possibly 
work" on the day, and then I moved on.

In other words, it's not an accident, or the final statement, or even 
something I strained myself over. It's a deliberate technical debt. The 
code is open to suggestions, and so am I.

J.

> -w