[okfn-labs] Data validation reporting
Friedrich Lindenberg
friedrich at pudo.org
Thu Nov 20 10:44:37 UTC 2014
Hey,
> On 19 Nov 2014, at 19:46, Paul Walsh <paulywalsh at gmail.com> wrote:
>> Something I would really need is a stand-alone, light-weight structured logging system for ETL processes in Python.
>
> That is def. in line with what we are aiming for here - something modular/pluggable rather than something tied to a specific validator. Not sure yet if Python or Node.
I struggle to see the point in Node-based ETL: the callbacks thing means it’s easy to just cram all the data down your pipeline at once if you don’t explicitly sequence it; and the programming language and libraries just aren’t good for data processing. No proper handling of dates and time, shitty NLP tools, even file handling is awkward. In short: I want to meet the people who would actually do this. And possibly bring them before the ICC.
>> That's the more general version of what we were doing with the UK25k stuff a few years ago, and it would not overlap with CSVLint. Features:
>>
>> * Be able to log structured data from ETL processes
>> * Generate custom, jinja2-based reports from them
>> * Have a set of pre-defined gauges to test for stuff like null values, extreme values etc.
>> * Have an emailer for certain really nasty events
>
> By gauges do you mean something beyond a schema, and more like:
>
> *expect values in col Y to be in range X
> * if more than X% of NULL in col Y, do something (warning after validation, etc)
>
> Let's be explicit here as as it sounds interesting and quite useful for large datasets.
Yep, that sort of stuff. It’s a notion that Stiivi had been exploring in earlier versions of this ETL framework, and it seems really cool to me.
- Friedrich
More information about the okfn-labs
mailing list