[wdmmg-discuss] Infer Dataset Size

William Waites ww at styx.org
Tue Feb 22 13:00:51 UTC 2011


The general case is hard (NP-Complete). To compare data set A with dataset
B you need to make sure you are comparing apples to apples so to speak. So
you need to know the "true" information content in each dataset after stripping
out any redundant information. Redundant information is information that 
can be inferred from other information in the dataset according to some
rules. Minimising a dataset means applying the rules in reverse. If the
rules are order independent then you can do this just once without problem.
If the rules are order dependent then you have to do this operation for 
every possible ordering of the data.

This is related to the literature on compression. In some sense you can
get an idea of relative size just by running a good compression algorithm
like gzip or bzip2 on both of them, but this won't tell you anything about
information as measured by number of spending entries, it will only tell
you information as measured in bits.

A rough guide by stripping out all aggregates above the individual entry
level might work, but then as the entries themselves are often undifferentiated
aggregates you have no guarantee that both datasets do this at the same
level. So it depends what you mean by size. Does size mean "number of 
entries at the most granular level available"? Even the rule-based approach
above assumes the most granular level is the same in both datasets...

Hope this helps.

Cheers,
-w

* [2011-02-22 12:52:24 +0100] Stefan Wehrmeyer <stefanwehrmeyer at gmail.com> écrit:

] Hello,
] 
] I'm working on the new homepage for OpenSpending and I want to put up a visualization similar to this: http://openspending.org/dataset/cra but I want to compare the size of the different datasets.
] 
] I understand that it is not straight-forward to determine the size of the dataset from the spending entries ? or is it? 
] I'm looking for something like total budget amount for a dataset, but that might not be the sum of the (non-aggregated) entries.
] 
] Do I manually need to enter a total amount for each dataset from some source or is there a feasible way to infer this number from the dataset's entries?
] 
] Cheers,
] Stefan
] _______________________________________________
] wdmmg-discuss mailing list
] wdmmg-discuss at lists.okfn.org
] http://lists.okfn.org/mailman/listinfo/wdmmg-discuss

-- 
William Waites                <mailto:ww at styx.org>
http://river.styx.org/ww/        <sip:ww at styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45




More information about the openspending mailing list