[wdmmg-dev] Data management ctd.

Friedrich Lindenberg friedrich.lindenberg at okfn.org
Mon Oct 24 21:26:05 UTC 2011


Hi all,

this is an attempt to summarize RL bits of the ongoing discussion
regarding data management (both with MK and Rufus). This text
intentionally troll.

Analysis:

- Contextualization is key: The loading process as it stands requires
a lot of context - both regarding the interplay between OpenSpending,
OpenSpending.ETL and CKAN and with regards to the actual data
transformation and modeling required on the data.

The current documentation is 40/60 split between explaining systems
integration and data wrangling, we should move to 5/95. Which means:
writing much more info on data wrangling. But: this is also where
refactoring the loading system comes in. Having a three-system jumping
game means much needs to be explained, and: its hard to put the right
docs in the right place. There is no information at all in CKAN re the
process, and OS/OS.ETL are both only partially responsible (i.e.
cannot fully guide through process).

- The CKAN metadata is not the OpenSpending metadata and any attempt
to fake this will fail. OpenSpending metadata is only likely to grow
and already requires special validation rules which are not
implemented in CKAN (e.g. currency, double underscores in dataset
names, ...).

- Storing the model in CKAN is wrong and leads to more problems than
it solves: we cannot migrate it, validation happens late, association
is fuzzy. This is necessarily part of the OS domain model and should
be managed internally, exposed only via the model editor and a
specific REST API.

Solution:

- Move most of the metadata management back to OS: basic metadata,
model editor, [views editor].

- Provide context sensitive documentation wherever conceivable.
Documentation is in OpenSpending, not wiki to force us to "productize"
and fine-tune docs (rather than wiki-style perpetual alpha). Also:
stable URLs.

- Support google docs sample data as a wrangling pattern.

Open Questions:

- As CKAN moves towards awesome, will OS have to replicate? Especially
re transformation tools?

- How do we reference data: by CKAN resource, download URL or direct
upload to OS? (CKAN res was last state but download URL seems more
flexible)

- Priority of documentation vs. refactoring (personal opinion:
refactoring blocks documentation)

- How do we get the loading process to generate far richer warnings?
IMO capturing stdout on etld is just not enough, need structured
feedback (failing row data, error message, error components, links to
docs re error compontents).

RfC,

 - Friedrich




More information about the openspending-dev mailing list