[wdmmg-dev] ETL FS update: source area tasks, dataset chunks
Stefan Urbanek
stefan.urbanek at gmail.com
Wed Aug 3 15:07:50 UTC 2011
Hi,
First: I've changed documentation to a sphinx/rst project. Currently here: https://github.com/Stiivi/openspending-doc but might be later moved anywhere else later.
I am syncing HTML currently here: http://democracyfarm.org/f/openspending/doc, if you have better place, that would be great. Preferably rsync-able.
Wiki is painful to edit, not mentioning working with documents with images. Also we can later get nice pdf/epub for openspending documentation :-)
Anyway, here are the latest updates for OpenSpending ETL functional specification:
* added some specification of ETL task running:
http://democracyfarm.org/f/openspending/doc/etl/processes.html#task-run
* added specification of basic tasks for the source data area
http://democracyfarm.org/f/openspending/doc/etl/processes.html#tasks
* introduced notion of dataset chunks (after discussion over skype with Nick (borior) yesterday). Dataset chunk is part of a dataset contained in a single (CSV) file or retrieved by single request. Currently each dataset consists of one big chunk, in the future there might be smaller dataset additions (monthly/yearly/...).
http://democracyfarm.org/f/openspending/doc/etl/dataset_lifecycle.html
Implementing the source area tasks can be made compatible with current state of openspending, without breaking anything on top of it. This migration step is not drastic - it is just next "salami slice" :-)
I would like to hear your opinion, suggestions, questions and worries.
Next logical steps would be:
1. implement ETL task running environment and processes
2. implement source area tasks (change existing tasks or reuse code if possible)
I can handle #1, after we all agree on the functional specification.
Left for the future: validation rules.
What do you say?
Stefan
http://databrewery.org
More information about the openspending-dev
mailing list