[openspending-dev] [okfn-labs] Data validation reporting
Paul Walsh
paulywalsh at gmail.com
Thu Nov 20 12:50:18 UTC 2014
Hi,
The point is not Node-based ETL or not - you mentioned Python in your post, and I didn’t want to commit to a Python solution (or any language) while we are talking about needs and use cases. Python and Node were mentioned as as they are the most obvious candidates for myself.
Can you point me to any examples of Stiivi’s work on “expected result detection” as you describe?
Thanks
> On 20 Nov 2014, at 12:44, Friedrich Lindenberg <friedrich at pudo.org> wrote:
>
> Hey,
>
>> On 19 Nov 2014, at 19:46, Paul Walsh <paulywalsh at gmail.com> wrote:
>>> Something I would really need is a stand-alone, light-weight structured logging system for ETL processes in Python.
>>
>> That is def. in line with what we are aiming for here - something modular/pluggable rather than something tied to a specific validator. Not sure yet if Python or Node.
>
> I struggle to see the point in Node-based ETL: the callbacks thing means it’s easy to just cram all the data down your pipeline at once if you don’t explicitly sequence it; and the programming language and libraries just aren’t good for data processing. No proper handling of dates and time, shitty NLP tools, even file handling is awkward. In short: I want to meet the people who would actually do this. And possibly bring them before the ICC.
>
>>> That's the more general version of what we were doing with the UK25k stuff a few years ago, and it would not overlap with CSVLint. Features:
>>>
>>> * Be able to log structured data from ETL processes
>>> * Generate custom, jinja2-based reports from them
>>> * Have a set of pre-defined gauges to test for stuff like null values, extreme values etc.
>>> * Have an emailer for certain really nasty events
>>
>> By gauges do you mean something beyond a schema, and more like:
>>
>> *expect values in col Y to be in range X
>> * if more than X% of NULL in col Y, do something (warning after validation, etc)
>>
>> Let's be explicit here as as it sounds interesting and quite useful for large datasets.
>
> Yep, that sort of stuff. It’s a notion that Stiivi had been exploring in earlier versions of this ETL framework, and it seems really cool to me.
>
> - Friedrich
More information about the openspending-dev
mailing list