[datacatalogs] Assessment of open data in Germany
matt.fullerton at gmail.com
Wed Oct 8 12:28:00 UTC 2014
cc catalogs list...
On 8 October 2014 14:25, Matthew Fullerton <matt.fullerton at gmail.com> wrote:
> Hi labs,
> A while back Rufus asked me to update the list on some work going on at
> OKF in Germany. Until the end of the year we are working on a project
> called 'Open Data Monitor' with university colleagues in Bremen. The goal
> is to assess how far along German villages, towns, cities, states and the
> central government is along the way to opening up data:
> - Is it happening at all?
> - Is the data that is officially 'open' the end of the story or are there
> other departments publishing data independently (business as usual, so to
> - How easy is it to find data on a regional basis?
> A large amount of the project work (at our end), is extracting
> (semi-)automatically what (open) data is on the websites of various
> authorities for evaluation. Hence, a byproduct of the project is a large
> collection of metadata from many catalogs that have metadata themselves
> (e.g. CKAN portals) as well as web crawls for typical 'data' formats, the
> results of which are reviewed by human beings. A more experimental part was
> also to process Google and Bing search results for 'data' formats.
> As this is the labs list, I'll try and stick to the technical
> insights/challenges and skip the political ones (if desired, I can make a
> separate summary of that, as so far most documentation is in German):
> - Google searches are very limited in what file formats you can specify,
> and there is no suitable API from Google for obtaining search results
> - Bing (yes, Bing!) has an API. The documentation is terrible, but it has
> one. Its not free, but you get a large number of searches for free. And you
> can search by file extension and domain, so it was exactly what we needed.
> - What is lacking from both search engines is contextual information that
> is crucial for constructing metadata: particularly whether the result is
> still linked to and if so, where from.
> - Web crawling is slow and tedious. Well, no insights there. But it does
> - Every catalog and (almost) every CKAN is different and can't be read
> from in the same way. Probably no insights there either.
> - German open data portals are very heterogeneuos: with around 14 portals
> we have CKAN, DKAN, terraCatalog and custom solutions integrated into
> existing web sites. Providing a complete read out or dump of the data seems
> to be a high priority for some and not for others.
> For the last few months, I will be working on consolidating and automating
> what is still mostly catalog-specific, manually-triggered code into a
> common library, and creating a DB-based infrastructure for hosting crawled
> data for review. After some discussion with Friedrich, I'm also going to
> try and index the data of the actual files available so that can be
> searched as well as the meta data.
> The whole thing will be made available on a CKAN instance. The current
> client-based, prototype overview of the data is reaching its limits (
> If you'd like to take a look or get involved, the code is at
> github.com/okfde/odm-datenerfassung. The client at
> guthub.com/okfde/odmonitormap. The code is changing frequently. There may
> be the possibility to pay a further Python developer until the end of 2014
> for anyone excited about this kind of thing with time on their hands.
> Looking forward to questions and criticisms. I just want to stress that
> the chief aim of the project is NOT to create one massive catalog to 'rule
> them all', we are just trying to see what is the most sensible way to
> archive the working data of the project so that its useful and sustainable
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the data-catalogs