[openspending-dev] Data upload failed

Mon Feb 6 10:12:52 UTC 2012

On 5 February 2012 11:06, Gregor Aisch <gregor.aisch at okfn.org> wrote:
>
> First of all, I totally agree that, for a project like OpenSpending,
> data-cleansing is out of scope. Also I assume that you guys already had a
> lot of discussion on this topic.
>
> But still, I think that, from a users perspective, there must be some data
> cleansing help from somewhere. If we keep ignoring this need, many users
> won't upload their datasets. Btw, do we already track the conversion rate of
> the dataset upload form?

Key thing I think here may be a validator. (Which could even get run
regularly at the DataHub level).

> Here are 3 different scenarios (of which I prefer the last):
>
> Scenario A:
> User loads his data into Google Refine and makes sure everything is clean
> and correct. After that, the user clicks on "upload to openspending.org".
>
> Scenario B:
> User loads his data into Google Fusion Tables and makes sure everything is
> clean and correct. After that, he creates a new dataset on openspending.org
> and selects the "import from google fusion tables" feature. After inserting
> the spreadsheet ID the data is loaded into the OS model editor.
>
> Scenario C:
> The user loads his data into CKAN which recognizes many different kinds of
> table data formats (csv,tsv,xls,...). CKAN will provide an API for
> OpenSpending to import the table data in the required format (csv strict).

Or, equivalently, a bot notices this has got updated on the DataHub.
Goes off and attempts to validate. If it fails it tries to improve
(perhaps using messytables) and resubmits the improved file to the
DataHub and notifies OS to import.

Rufus

> Of course, scenario A and C can be combined by enabling users to directly
> upload from Refine to CKAN (in a data format accepted by OS).
>
>
>
>
>
> Am 05.02.2012 um 00:59 schrieb Friedrich Lindenberg:
>
> Right, but this is a big difference between Refine and OpenSpending:
> Refine is a data-cleansing tool so one of the priorities is that it
> can parse almost any kind of input, as long as there is some basic way
> of guessing its content (in fact, you can even fine-tune the amount of
> guessing it will do by disabling type detection etc.).
>
> OpenSpending is quite different in that we are explicitly not handling
> data cleansing: there is a long and somewhat painful list [1] of
> formatting standards your data needs to conform to in order to be
> loadable. The reason behind this is that you really do need data that
> is consistent to perform any kind of analysis.
>
> At the same time, in order to do cleansing, you need something that is
> at least as powerful as Refine - and OS just cannot provide that. So
> I'd much rather send you back to your data-wrangling tool with some
> useful (!, not gonna cite RFCs) messages than attempt to do half-assed
> data cleansing on the fly.
>
> Of course, CSV/TSV is an edge case here, but the general rule applies:
> OS is strict in what it accepts, so that the outcome will remain
> useable.
>
> tl;dr - I don't think postel's law applies to databases.
>
> - Friedrich
>
> [1] http://openspending.org/help/data-cleansing.html#some-common-problems
>
>
> On Sat, Feb 4, 2012 at 11:46 PM, Gregor Aisch <gregor.aisch at okfn.org> wrote:
>
>
> Also, generally spoken, when designing a system for uploading data, I'd
>
> always prefer the try-to-read-everything strategy over forcing
>
> (inexperienced) users to convert data to some nerdy RFC standards. In fact,
>
> 80% of our users will not try to upload their data again after facing an
>
> error message like "sorry, but your data is not in the right format (=you
>
> suck). please read the RFC 4180 for more details (=just give it up,
>
> stupid).".
>
>
> I really love how Refine handles data imports.
>
>
>
>
>
> Am 05.02.2012 um 00:24 schrieb Friedrich Lindenberg:
>
>
> Hey Gregor,
>
>
> thanks for trying this but I'm not sure we want to support this - I've
>
> actually limited the set of things messytables will do in OpenSpending
>
> intentionally because I think that when data is handed to OS, it
>
> should already be formatted properly (which includes using actual
>
> CSV).
>
>
> I'll soon have to do a lot of fixes against messytables for the DGU
>
> spend, but suspect enabling all this in OpenSpending may actually lead
>
> to more ambiguity than just having a clear rule
>
> (http://tools.ietf.org/html/rfc4180)
>
>
> What do you think?
>
>
> - Friedrich
>
>
> On Sat, Feb 4, 2012 at 6:42 PM, Gregor Aisch <gregor.aisch at okfn.org> wrote:
>
>
> Tried to upload some spending data to openspending to find out that the
>
>
> automated CSV recognition failed to detect the tab-separated table..
>
>
>
> Seems to be a bug in messytables so I added a new issue. Who's maintaining
>
>
> that package?
>
>
>
> https://github.com/okfn/messytables/issues/3
>
>
>
>
>
> _______________________________________________
>
>
> openspending-dev mailing list
>
>
> openspending-dev at lists.okfn.org
>
>
> http://lists.okfn.org/mailman/listinfo/openspending-dev
>
>
>
>
>
>
> _______________________________________________
> openspending-dev mailing list
> openspending-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending-dev
>

-- 
Co-Founder, Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/