[wdmmg-dev] ETL Manager prototype

Stefan Urbanek stefan.urbanek at gmail.com
Mon Aug 8 22:29:21 UTC 2011


Hi,

I've created a prototype for the ETL manager (temporary repository, before integration):

https://github.com/Stiivi/openspending.etlm

The manager dependencies are minimal: requires only PostgreSQL. Instead of Paster, there is light-weighted custom Task running tool. For more information about task, see below.

In DB it will use four schemas:

* etl – for etl control structures, such as task_run
* metadata – for dataset metadata (not yet used)
* source – for dataset source data (not yet used)
* ods – operational data store - for stars and snowflakes (not yet used)

TASK EXECUTION

Task 'run' method is wrapped and the single task run is as follows:

1. create and initialize task_run record – set task name, start time, ….
2. save the task_run record into DB
3. log message about task being run
4. try to run the task
5. if exception occures during task run, then:
	5.1. log information about exception
	5.2 log with 'debug' severity full exception stack trace
	5.3 set task status as failed
	5.4. save task_run record
6. store task end time, compute elapsed time for log purposes
7. store task_run record
8. log information that the task finished

Task_run record is written to DB whenever task status changes. I do not think that we need whole status change history, just last status and message to know where and why it failed.

With this we have unified auditable task execution process.

Example use:

$ osetl dummy test
2011-08-08 23:03:59,184 Running task dummy
2011-08-08 23:03:59,193 this is dummy task: test None None
2011-08-08 23:03:59,197 Task dummy finished. Status: ok Elapsed time: 0:00:00.007472

See attachment for a sshot from DB or go here if mailinglist is blicking attachments:

	http://cl.ly/2P3O080Y240T2O0r2d1L

The task run ID might be used as reference in other logs or in tables where issues are being collected.

TASKS

ETL task is a class with one required method 'run'. Class might have class variable 'arguments' as list of dictionaries, which is described bellow. Reason for the 'arguments' variable instead of paster's way of argument configuration is, that there might be different front-end for the tasks, such as web, not only command-line. From this simple dictionary description it is easy to generate command-line arguments or a web form.

Also it is possible to get list of all tasks: manager.all_tasks() -> dictionary of name=task_class.

(following text is from source documentation)

Task Arguments
~~~~~~~~~~~~~~

Task might optionally take one or more arguments. The user interfaces might want to know
about the taks arguments, to be able to display help in command-line, or create appropriate
form in the web based task runner. The class variable `arguments` is a list of dictionaries
where the dictionary contains keys: ``name`` - name of the argument, ``optional`` - flag whether
the argument is required or optional. ``type`` - argument type.

In command-line interface, the optional arguments are handled through --argument. Currently all
arguments are of type ``string``, except ``flag`` which is handled as boolean, storing ``True`` to the
argument, when present on command-line.

Subclassing decisions
~~~~~~~~~~~~~~~~~~~~~

* configurability of tasks should be kept at minimum. More configuration options introduce more noise
  and less obvious transparency. 
* there should be no more arguments for a task, than context specification (such as dataset, chunk) and
  one or two flags to control the process, if not really necessary
  
Recommended options:

* `force` flag - if task is destructive, force running without requiring to explicitly prepare
  acceptable state
* `test` flag - do not process all data, just a sample. this flag is preffered to explicit sample size.
  should be configurable in a configuration file.

Will definitely have to discuss it over skype/irc.

MIGRATION COMPLIANCE NOTE

Current new ETL tasks might be easily migrated to this manager soon. They might still use MongoDB as their backend. What is missing in the manager is possibility to plug-in other modules. Currenlty only subclasses within the manager package are considered. Well … it is still a prototype :-)

If you have any questions, comments or worries, let me know.

Stefan Urbanek

senior business intelligence consultant
http://knowerce.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20110809/836f35d7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image 2011.08.09 0-24-24.png
Type: image/png
Size: 22573 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20110809/836f35d7/attachment-0001.png>


More information about the openspending-dev mailing list