[openspending-dev] [okfn-labs] Data validation reporting

Paul Walsh paulywalsh at gmail.com
Wed Nov 19 17:46:17 UTC 2014


Hi Friedrich,

> On 19 Nov 2014, at 14:50, Friedrich Lindenberg <friedrich.lindenberg at okfn.org> wrote:
> 
> Something I would really need is a stand-alone, light-weight structured logging system for ETL processes in Python.

That is def. in line with what we are aiming for here - something modular/pluggable rather than something tied to a specific validator. Not sure yet if Python or Node.

> That's the more general version of what we were doing with the UK25k stuff a few years ago, and it would not overlap with CSVLint. Features: 
> 
> * Be able to log structured data from ETL processes
> * Generate custom, jinja2-based reports from them
> * Have a set of pre-defined gauges to test for stuff like null values, extreme values etc. 
> * Have an emailer for certain really nasty events 

By gauges do you mean something beyond a schema, and more like:

*expect values in col Y to be in range X
* if more than X% of NULL in col Y, do something (warning after validation, etc)

Let's be explicit here as as it sounds interesting and quite useful for large datasets.

> 
> Here's a very duct-tapey version of this: https://github.com/pudo/scrapekit/blob/master/scrapekit/logs.py - basically just hacked the Python logger to make JSON. Wanted to extract it as "reportkit", but haven't gotten around to that. 

Nice hack.

Paul

> 
> - Friedrich 
> 
> 
>> On Wed, Nov 19, 2014 at 1:05 PM, Ross Jones <ross at servercode.co.uk> wrote:
>> Oh I see.  If it’s any use, the csvlint user stories from the workshops that were run are at https://docs.google.com/spreadsheet/ccc?key=0AiswT8ko8hb4dERHUVBKYlBZVnlYSHI5M2V1TVpodlE&usp=sharing#gid=0
>> 
>> Ross
>> 
>> 
>>> On 19 Nov 2014, at 09:38, Paul Walsh <paulywalsh at gmail.com> wrote:
>>> 
>>> Hi Ross,
>>> 
>>> Yes, I’ve looked at csvlint. Before getting to the solution to the problem (csvlint, something else) we first want to ensure we know the scope of the problem itself, define use cases, etc. 
>>> 
>>> But sure, csvlint does meet some of the current requirements we have in mind.
>>> 
>>> Paul
>>> 
>>> 
>>>>> On 19 Nov 2014, at 11:09, Ross Jones <ross at servercode.co.uk> wrote:
>>>>> 
>>>>> Hi Paul,
>>>>> 
>>>>> On 19 Nov 2014, at 09:06, Paul Walsh <paulywalsh at gmail.com> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I’m working on data validation (particularly *tabular* data validation) with Rufus.
>>>>> 
>>>>> In particular, we are looking to provide a great interface to *reporting* on the validation flow. In general, this means error reports resulting from the validation process, but also summary stuff (what happened, data stats).
>>>> 
>>>> 
>>>> Have you investigated http://csvlint.io (https://github.com/theodi/csvlint) yet?  That seems to solve most of the problems that you mentioned, and I’m sure it could be extended to support the others.
>>>> 
>>>> 
>>>> Ross
>> 
>> 
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20141119/9046efd2/attachment-0002.html>


More information about the openspending-dev mailing list