[ckan-dev] RFC: Data framework (Swiss + Brewery consolidation)

Tue Jan 4 10:09:31 UTC 2011

Hi,

As a newbie to the project, I don't have much specific input, but it
sounds like a generally Good Thing.

One thing I was wondering is how datastore schemas will be defined in
a backend-neutral way for non-relational *and* relational stores; or
is a domain-specific model part of the spec for an implementation?

If a developer needs to do quite a lot of work manually mapping to the
storage implementation anyway, there might be limited benefit to
writing and supporting a framework over letting them do it "by hand"
each time (given the most common use cases in a particular application
are likely to be csv -> specific database).

This is no doubt obvious to everyone who's been working on the project
for longer than me, so it would really help my understanding if you
could describe the current situation, with an example or two, to
illustrate the current problems that this will solve?

Thanks,

Seb

On 3 January 2011 11:45, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
> Hi,
> After discussion last week with Rufus about Swiss and Data Brewery, I think
> I would be good idea to join two frameworks together to create common data
> handling API.
> Contents:
>
> REASON AND OBJECTIVE
> API
> ADAPTERS
> USAGE
> EXAMPLE
> RELATED CKAN TICKETS
> REFERENCES
> CONCLUSION
>
> REASON AND OBJECTIVE
> Unified simple API for handling various kinds of structured data is missing.
> Create a data handling framework which would be able to:
> - read various structured data sources in common unified way
> - provide various structured data outputs
> - allow to apply structured data filters/transformations
> - preserve metadata information as much as possible
> API
> Data Streams: 'file IO'-like access to data sources/targets where instead of
> 'bytes' data tuples  + metadata (field list) are being streamed With data
> streams you read and process data record-by-record. Current Swiss situation:
> Swiss reads whole data into "TabularData" form, all at once, which is not
> very practical for large datasetes.
> Source Data Stream:
> - initialize() - optional method for initializing the data stream. Better to
> have it out of __ini__ method for cases where we want to set-up our data
> stream first and delay initialization of all sources together after the
> whole stream is ready. The method should open a file, read field names (csv
> header), ...
> - rows() - iterator over data tuples
> - fields - return list of fields in the stream
> - records() - optional iterator over record dictionary field = value
> Target Data Stream:
>
> - initialize() - optional method for initializing the data stream, see
> initialize() in source stream
>
> - append(object) - appen object into data stream. object should be either
> list, tuple or dictionary.
> - fields - return list of fields that target data stream expects (for
> example if the target is database table), can be used for error prevention
> More on streams: http://databrewery.org/doc/streams.html
> Data Stores - data store is a repository of datasets, such as relational
> database or spreadsheet with multiple sheets
> - dataset(name) - get dataset with given name/identifier
> - dataset_names - list datasets (tables)
> - destroy_dataset(name) - drop table, remove workbook, delete file, ...
> (does not need to be supported by all datastores)
> - has_dataset(name) - check for dataset existency
> - create_dataset(name, fields) - create new dataset, table, workbook,
> file,...
> Data store functionality is provided by datastore adapters, such as:
> sqlalchemy for all relational databases, mongodb, ...
> Dataset - database table, spreadsheet workbook, directory with yaml files,
> ... Datasets can be used as data stream sources or targets.
> In addition to stream source/target API the dataset supports:
> - read_fields(limit) - try to guess field metadata from dataset - peek CSV
> header, go through mongodb records, ...
> - truncate() - remove all records from dataset (such as DELETE in relational
> db)
> More on datastores: http://databrewery.org/doc/api/datastores.html
> ADAPTERS
> By sharing code from the Swiss framework and Brewery we can get following
> data (streaming) adapters:
> - relational databases (from brewery, supported by SQL alchemy),
> source+target
> - mongodb (from brewery), source+target
> - CSV (from swiss), source + target
> - google docs (from swiss), srouce only
> - HTML (from swiss), source + target
> - json (from swiss), source + target
> - xls (from swiss), source
> USAGE
> The data framework is required for:
> - data preview
> - metadata discovery
> - worker process for resource uploading: "it should work in similar way how
> document uploading on scribd/slideshare works, from user's perspective.
> resource is queued and not only mirrored ( = archived), but also all
> necessary metadata (preview, fields, ...) is extracted and stored back to
> ckan". [pudo:] "you upload something and when you come back a day later,
> magic fun has happened"
> - simplified process of quality auditing (probes within streams)
> - ability to write abstract structured data cleansers/transformers/analysers
> in the future
> EXAMPLE
> Copy from CSV to PostgreSQL, to MongoDB and to nice HTML:
>     source = brewery.ds.CSVDataSource('transactions.csv')
>     psql = brewery.ds.RelationalDataTarget(psql_connection, 'transactions')
>     mongo = brewery.ds.MongoDataTarget(mongo_connection, 'transactions')
>     html = brewery.ds.HTMLDataDarget('transactions.html')
>     for record in source_set.records():
>         transformed = do_something_if_anything_at_all(record)
>         psql.append(transformed)
> mongo.append(transformed)
>         html.append(transformed)
> Audit records in a CSV:
>
> source = brewery.ds.CSVDataSource('transactions.csv')
> field_stats = {}
> for field in source.fields:
>     field_stats[field.name] = brewery.dq.FieldStatistics(field)
>
> for record in source.records():
>     for field, value in record.items():
>         stat = field_stats[field]
>         stat.probe(value)
>
>
> RELATED CKAN TICKETS
> Following tickets can benefit from proposed framework:
> - Improvements to the dataproxy and the data API:
> http://knowledgeforge.net/ckan/trac/ticket/888
> - Resource format normalization and detection:
> http://knowledgeforge.net/ckan/trac/ticket/235
> - Dataset upload and archiving (master ticket):
> http://knowledgeforge.net/ckan/trac/ticket/852
> REFERENCES
> Swiss:
> https://bitbucket.org/okfn/swiss/overview
> Data Brewery:
> Streams: http://databrewery.org/doc/streams.html
> Stores: http://databrewery.org/doc/api/datastores.html
> CONCLUSION
> With the proposed framework we will get:
> - abstract structured data handling
> - adapter-based modular architecture for implementing data sources (readers)
> and targets (writers)
> - ability to handle large datasets (through streaming)
> - foundation for data transformations and analysis
>
> Feel free to report this to other relevant groups if you find it
> appropriate.
> What do you think?
> Regards,
>
> Stefan Urbanek
> freelance consultant, analyst
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
>
>

-- 
skype: seb.bacon
mobile: 07790 939224
land: 0207 183 9618
web: http://baconconsulting.co.uk