[ckan-dev] RFC: Data framework (Swiss + Brewery consolidation)

Mon Jan 3 11:45:49 UTC 2011

Hi,

After discussion last week with Rufus about Swiss and Data Brewery, I think I would be good idea to join two frameworks together to create common data handling API.

Contents:

REASON AND OBJECTIVE
API
ADAPTERS
USAGE
EXAMPLE
RELATED CKAN TICKETS
REFERENCES
CONCLUSION

REASON AND OBJECTIVE

Unified simple API for handling various kinds of structured data is missing.

Create a data handling framework which would be able to:
- read various structured data sources in common unified way
- provide various structured data outputs
- allow to apply structured data filters/transformations
- preserve metadata information as much as possible

API

Data Streams: 'file IO'-like access to data sources/targets where instead of 'bytes' data tuples  + metadata (field list) are being streamed With data streams you read and process data record-by-record. Current Swiss situation: Swiss reads whole data into "TabularData" form, all at once, which is not very practical for large datasetes.

Source Data Stream:
- initialize() - optional method for initializing the data stream. Better to have it out of __ini__ method for cases where we want to set-up our data stream first and delay initialization of all sources together after the whole stream is ready. The method should open a file, read field names (csv header), ...
- rows() - iterator over data tuples
- fields - return list of fields in the stream
- records() - optional iterator over record dictionary field = value

Target Data Stream:
- initialize() - optional method for initializing the data stream, see initialize() in source stream
- append(object) - appen object into data stream. object should be either list, tuple or dictionary.
- fields - return list of fields that target data stream expects (for example if the target is database table), can be used for error prevention

More on streams: http://databrewery.org/doc/streams.html

Data Stores - data store is a repository of datasets, such as relational database or spreadsheet with multiple sheets
- dataset(name) - get dataset with given name/identifier
- dataset_names - list datasets (tables)
- destroy_dataset(name) - drop table, remove workbook, delete file, ... (does not need to be supported by all datastores)
- has_dataset(name) - check for dataset existency
- create_dataset(name, fields) - create new dataset, table, workbook, file,...

Data store functionality is provided by datastore adapters, such as: sqlalchemy for all relational databases, mongodb, ...

Dataset - database table, spreadsheet workbook, directory with yaml files, ... Datasets can be used as data stream sources or targets.
In addition to stream source/target API the dataset supports:
- read_fields(limit) - try to guess field metadata from dataset - peek CSV header, go through mongodb records, ...
- truncate() - remove all records from dataset (such as DELETE in relational db)

More on datastores: http://databrewery.org/doc/api/datastores.html

ADAPTERS

By sharing code from the Swiss framework and Brewery we can get following data (streaming) adapters:
- relational databases (from brewery, supported by SQL alchemy), source+target
- mongodb (from brewery), source+target
- CSV (from swiss), source + target
- google docs (from swiss), srouce only
- HTML (from swiss), source + target
- json (from swiss), source + target
- xls (from swiss), source

USAGE

The data framework is required for:
- data preview
- metadata discovery
- worker process for resource uploading: "it should work in similar way how document uploading on scribd/slideshare works, from user's perspective. resource is queued and not only mirrored ( = archived), but also all necessary metadata (preview, fields, ...) is extracted and stored back to ckan". [pudo:] "you upload something and when you come back a day later, magic fun has happened"
- simplified process of quality auditing (probes within streams)
- ability to write abstract structured data cleansers/transformers/analysers in the future

EXAMPLE

Copy from CSV to PostgreSQL, to MongoDB and to nice HTML:

    source = brewery.ds.CSVDataSource('transactions.csv')
    psql = brewery.ds.RelationalDataTarget(psql_connection, 'transactions')
    mongo = brewery.ds.MongoDataTarget(mongo_connection, 'transactions')
    html = brewery.ds.HTMLDataDarget('transactions.html')

    for record in source_set.records():
        transformed = do_something_if_anything_at_all(record)
        psql.append(transformed)
	mongo.append(transformed)
        html.append(transformed)

Audit records in a CSV:

source = brewery.ds.CSVDataSource('transactions.csv')

field_stats = {}

for field in source.fields:
    field_stats[field.name] = brewery.dq.FieldStatistics(field)

for record in source.records():
    for field, value in record.items():
        stat = field_stats[field]
        stat.probe(value)

RELATED CKAN TICKETS

Following tickets can benefit from proposed framework:

- Improvements to the dataproxy and the data API: http://knowledgeforge.net/ckan/trac/ticket/888
- Resource format normalization and detection: http://knowledgeforge.net/ckan/trac/ticket/235
- Dataset upload and archiving (master ticket): http://knowledgeforge.net/ckan/trac/ticket/852

REFERENCES

Swiss:
	https://bitbucket.org/okfn/swiss/overview
Data Brewery:
	Streams: http://databrewery.org/doc/streams.html
	Stores: http://databrewery.org/doc/api/datastores.html

CONCLUSION

With the proposed framework we will get:
- abstract structured data handling
- adapter-based modular architecture for implementing data sources (readers) and targets (writers)
- ability to handle large datasets (through streaming)
- foundation for data transformations and analysis

Feel free to report this to other relevant groups if you find it appropriate.

What do you think?

Regards,

Stefan Urbanek
freelance consultant, analyst

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20110103/9a5551d4/attachment-0001.html>