[ckan-dev] Harvesting improvement ideas for ArcGIS

Thu Dec 10 03:45:29 UTC 2015

So ESRI found a way to break the flow of data leaving their ecosystem again.

I haven't run into this problem, as I haven't harvested
ArcGIS-Server-external data sources through ArcGIS Server REST.
In addition, as I'm harvesting into a custom metadata schema, the default
harvesting behaviour of creating "extra" fields conflicts with
ckanext-scheming's implementation of custom dataset fields (no populated
"extra" fields allowed). My approach was to script the harvesting
(workbooks *here <https://github.com/datawagovau/harvesters>*) until our
ever-changing harvesting requirements are stable before attempting to
customise harvester extensions. Of course, any change in input schema (e.g.
parsing extra metadata out of Esri's "description" field) or changes to the
custom schema on CKAN would require maintenance of the custom harvester
extension.

Would the harvester's message queue be of any use to create a sub queue for
the harvesting process?

Cheers,
Florian

On Thu, Dec 10, 2015 at 11:00 AM, Steven De Costa <
steven.decosta at linkdigital.com.au> wrote:

> Heya folks,
>
> We are looking at the best way to close the gap on a harvest workflow
> between CKAN and ArcGIS.
>
> If you look at the info on the ArcGIS site here:
> http://doc.arcgis.com/en/open-data/provider/federating-with-ckan.htm
>
> Then, the problem we want to solve is mentioned at the end of the page.
> That is:
> *"Note: You may notice some strange behavior the first time you try to
> preview a CSV or JSON file. Open Data is generating a cache of this data
> and CKAN does not know how to handle this case when the data is processing.
> This will not occur again the next time you try to preview the file."*
>
> The effect of this for a site that actually wants to harvest the resource
> rather than just store the resource URL is that we end up getting a
> response when requesting something like aCSV or geojson that looks like
> this:
> {"processing_time":"0.005 seconds","count":1,"generating":
>
> {"progress":"100%","start":144215,"csv":"generated","geojson":true,"kml":true}
>
> We have decided that we'll need to parse such results and create a sub
> queue that can be rerun to gather actual files when the generation of the
> file is complete. We have found that if ArcGIS is itself harvesting from a
> third source in realtime then the end result might be an error, so we also
> need to handle these in the main queue and sub queue.
>
> I'm not really posting this as a question, but it would be great to know
> if anyone has already built this kind of process extension to harvesting.
> We'd rather use an approach that is commonly used in such cases than create
> a whole new approach :)
>
> Any thoughts or pointers?
>
> Cheers,
> Steven
>
>
> *STEVEN DE COSTA *|
> *EXECUTIVE DIRECTOR*www.linkdigital.com.au
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20151210/4e168051/attachment-0003.html>