[wdmmg-dev] ETL FS update: source area tasks, dataset chunks

Stefan Urbanek stefan.urbanek at gmail.com
Wed Aug 3 15:07:50 UTC 2011


Hi,

First: I've changed documentation to a sphinx/rst project. Currently here: https://github.com/Stiivi/openspending-doc but might be later moved anywhere else later.

I am syncing HTML currently here: http://democracyfarm.org/f/openspending/doc, if you have better place, that would be great. Preferably rsync-able.

Wiki is painful to edit, not mentioning working with documents with images. Also we can later get nice pdf/epub for openspending documentation :-)

Anyway, here are the latest updates for OpenSpending ETL functional specification:

* added some specification of ETL task running:

	http://democracyfarm.org/f/openspending/doc/etl/processes.html#task-run

* added specification of basic tasks for the source data area

	http://democracyfarm.org/f/openspending/doc/etl/processes.html#tasks

* introduced notion of dataset chunks (after discussion over skype with Nick (borior) yesterday). Dataset chunk is part of a dataset contained in a single (CSV) file or retrieved by single request. Currently each dataset consists of one big chunk, in the future there might be smaller dataset additions (monthly/yearly/...).

	http://democracyfarm.org/f/openspending/doc/etl/dataset_lifecycle.html


Implementing the source area tasks can be made compatible with current state of openspending, without breaking anything on top of it. This migration step is not drastic - it is just next "salami slice" :-)

I would like to hear your opinion, suggestions, questions and worries.

Next logical steps would be:

1. implement ETL task running environment and processes
2. implement source area tasks (change existing tasks or reuse code if possible)

I can handle #1, after we all agree on the functional specification.

Left for the future: validation rules.

What do you say?

Stefan

http://databrewery.org







More information about the openspending-dev mailing list