[OpenSpending-discuss] How Spending Stories Spots Errors in Spending Data

Tue Dec 6 10:31:03 UTC 2011

On Tue, Dec 6, 2011 at 9:12 PM, Friedrich Lindenberg
<friedrich.lindenberg at okfn.org> wrote:
> On Tue, Dec 6, 2011 at 11:01 AM, Alex (Maxious) Sadleir
> <maxious at gmail.com> wrote:
>> Something I have been considering in preparing Australian data for
>> OpenSpending is that it would be very interesting to have a software
>> package to rate spending items like an email spam filter detects spam.
>> The most interesting spending items to a human are often the ones with
>> the most things "wrong" with them on a data level - edited to increase
>> the value by double, recorded in a database after the money was spent,
>> a government agency or supplier dealing in large sums all of a sudden.
>
> This is really important and we've been discussing it a bit. The
> question really is: what kind of algorithms/heuristics can we use to
> detect outliers? Are these techniques one-size-fits-all, or do we need
> to select different ones for each dataset (I'm pretty sure they need
> to be different for spending and budget, but think we can generalize
> otherwise)? And: when do they get run? Are these still QA measures
> we're talking about or is it actual data-mining on the loaded data?
>
> I'd tend to see it as the latter and was actually thinking about the
> idea of analytics snippets: we could just offer the option to run
> pieces of javascript (its easy to sandbox and learn) on an entire
> dataset, emitting "matching" records into a "review bucket". We can
> then share the snippets between datasets - if someone implements a
> nice algo, this could be parameterized and re-used.
>
> What do people think?

Definitely more on the data mining side! I think there are some
algorithms/statistical techniques that any financial dataset could
benefit from like Gini coefficent/Pareto distribution/Benford's Law.
This could also serve to introduce people to a slightly more advanced
world of data science if it's presented nicely.

Javascript sounds interesting especially thinking about the things
people have managed to do with CouchDB views. The wisdom of crowds via
sharing snippets would be good too - I could spend alot of time
writing inefficient heuristics on my own ;)