[okfn-labs] Data validation reporting

Sat Nov 22 19:00:43 UTC 2014

Excellent, thanks for the links.

> On 20 Nov 2014, at 18:31, Friedrich Lindenberg <friedrich at pudo.org> wrote:
> 
> Hey, 
> 
>> The point is not Node-based ETL or not - you mentioned Python in your post, and I didn’t want to commit to a Python solution (or any language) while we are talking about needs and use cases. Python and Node were mentioned as as they are the most obvious candidates for myself.
> 
> Sure, I was just ranting a bit. I guess my use case starts with “As a Python developer, I want to…” ;)
> 
>> Can you point me to any examples of Stiivi’s work on “expected result detection” as you describe?
> 
> Check out: 
> 
> https://github.com/Stiivi/brewery/blob/master/examples/audit_unknown_csv.py
> https://github.com/Stiivi/brewery/blob/master/brewery/probes.py
> 
> Cheers, 
> 
> - Friedrich 
> 
> 
>> 
>> Thanks
>> 
>>> On 20 Nov 2014, at 12:44, Friedrich Lindenberg <friedrich at pudo.org> wrote:
>>> 
>>> Hey, 
>>> 
>>>> On 19 Nov 2014, at 19:46, Paul Walsh <paulywalsh at gmail.com> wrote:
>>>>> Something I would really need is a stand-alone, light-weight structured logging system for ETL processes in Python.
>>>> 
>>>> That is def. in line with what we are aiming for here - something modular/pluggable rather than something tied to a specific validator. Not sure yet if Python or Node.
>>> 
>>> I struggle to see the point in Node-based ETL: the callbacks thing means it’s easy to just cram all the data down your pipeline at once if you don’t explicitly sequence it; and the programming language and libraries just aren’t good for data processing. No proper handling of dates and time, shitty NLP tools, even file handling is awkward. In short: I want to meet the people who would actually do this. And possibly bring them before the ICC.
>>> 
>>>>> That's the more general version of what we were doing with the UK25k stuff a few years ago, and it would not overlap with CSVLint. Features: 
>>>>> 
>>>>> * Be able to log structured data from ETL processes
>>>>> * Generate custom, jinja2-based reports from them
>>>>> * Have a set of pre-defined gauges to test for stuff like null values, extreme values etc. 
>>>>> * Have an emailer for certain really nasty events 
>>>> 
>>>> By gauges do you mean something beyond a schema, and more like:
>>>> 
>>>> *expect values in col Y to be in range X
>>>> * if more than X% of NULL in col Y, do something (warning after validation, etc)
>>>> 
>>>> Let's be explicit here as as it sounds interesting and quite useful for large datasets.
>>> 
>>> Yep, that sort of stuff. It’s a notion that Stiivi had been exploring in earlier versions of this ETL framework, and it seems really cool to me. 
>>> 
>>> - Friedrich
>> 
>