[openspending-dev] Data upload failed

Mon Feb 6 15:32:22 UTC 2012

Hi Rufus,

On Mon, Feb 6, 2012 at 10:12 AM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 5 February 2012 11:06, Gregor Aisch <gregor.aisch at okfn.org> wrote:
>>
>> First of all, I totally agree that, for a project like OpenSpending,
>> data-cleansing is out of scope. Also I assume that you guys already had a
>> lot of discussion on this topic.
>>
>> But still, I think that, from a users perspective, there must be some data
>> cleansing help from somewhere. If we keep ignoring this need, many users
>> won't upload their datasets. Btw, do we already track the conversion rate of
>> the dataset upload form?
>
> Key thing I think here may be a validator. (Which could even get run
> regularly at the DataHub level).

So its a good thing we already have two of these :) - We need to
clarify the workflow for the local data validator though, given that
the default way of creating the model increasingly tends to be the the
web ui. Maybe osvalidate should expand "de-bund" to
http://openspending.org/de-bund/model.json in the background.

>> Here are 3 different scenarios (of which I prefer the last):
>>
>> Scenario A:
>> User loads his data into Google Refine and makes sure everything is clean
>> and correct. After that, the user clicks on "upload to openspending.org".

Requires some Java coding but I think the CKAN guys need to do it
anyway, maybe we can just jump on the bandwagon once they're done.

>> Scenario B:
>> User loads his data into Google Fusion Tables and makes sure everything is
>> clean and correct. After that, he creates a new dataset on openspending.org
>> and selects the "import from google fusion tables" feature. After inserting
>> the spreadsheet ID the data is loaded into the OS model editor.

I think Fusion Tables doesn't help us there: its also not primarily
about data cleansing; although that may change.  So this just
introduces more abstraction without much gain.

>> Scenario C:
>> The user loads his data into CKAN which recognizes many different kinds of
>> table data formats (csv,tsv,xls,...). CKAN will provide an API for
>> OpenSpending to import the table data in the required format (csv strict).

It does that at the moment, but its kind of hard to explain. Plus I've
never actually seen OS's CKAN integration work, really need to do
that.

> Or, equivalently, a bot notices this has got updated on the DataHub.
> Goes off and attempts to validate. If it fails it tries to improve
> (perhaps using messytables) and resubmits the improved file to the
> DataHub and notifies OS to import.

Not sure how this would work: the bot would try to parse the CSV or
also attempt to validate using an OpenSpending model?

Can we get DSPL in CKAN, please?

- Friedrich