[ckan-dev] RFC: Data framework (Swiss + Brewery consolidation)
Stefan Urbanek
stefan.urbanek at gmail.com
Mon Jan 3 11:45:49 UTC 2011
Hi,
After discussion last week with Rufus about Swiss and Data Brewery, I think I would be good idea to join two frameworks together to create common data handling API.
Contents:
REASON AND OBJECTIVE
API
ADAPTERS
USAGE
EXAMPLE
RELATED CKAN TICKETS
REFERENCES
CONCLUSION
REASON AND OBJECTIVE
Unified simple API for handling various kinds of structured data is missing.
Create a data handling framework which would be able to:
- read various structured data sources in common unified way
- provide various structured data outputs
- allow to apply structured data filters/transformations
- preserve metadata information as much as possible
API
Data Streams: 'file IO'-like access to data sources/targets where instead of 'bytes' data tuples + metadata (field list) are being streamed With data streams you read and process data record-by-record. Current Swiss situation: Swiss reads whole data into "TabularData" form, all at once, which is not very practical for large datasetes.
Source Data Stream:
- initialize() - optional method for initializing the data stream. Better to have it out of __ini__ method for cases where we want to set-up our data stream first and delay initialization of all sources together after the whole stream is ready. The method should open a file, read field names (csv header), ...
- rows() - iterator over data tuples
- fields - return list of fields in the stream
- records() - optional iterator over record dictionary field = value
Target Data Stream:
- initialize() - optional method for initializing the data stream, see initialize() in source stream
- append(object) - appen object into data stream. object should be either list, tuple or dictionary.
- fields - return list of fields that target data stream expects (for example if the target is database table), can be used for error prevention
More on streams: http://databrewery.org/doc/streams.html
Data Stores - data store is a repository of datasets, such as relational database or spreadsheet with multiple sheets
- dataset(name) - get dataset with given name/identifier
- dataset_names - list datasets (tables)
- destroy_dataset(name) - drop table, remove workbook, delete file, ... (does not need to be supported by all datastores)
- has_dataset(name) - check for dataset existency
- create_dataset(name, fields) - create new dataset, table, workbook, file,...
Data store functionality is provided by datastore adapters, such as: sqlalchemy for all relational databases, mongodb, ...
Dataset - database table, spreadsheet workbook, directory with yaml files, ... Datasets can be used as data stream sources or targets.
In addition to stream source/target API the dataset supports:
- read_fields(limit) - try to guess field metadata from dataset - peek CSV header, go through mongodb records, ...
- truncate() - remove all records from dataset (such as DELETE in relational db)
More on datastores: http://databrewery.org/doc/api/datastores.html
ADAPTERS
By sharing code from the Swiss framework and Brewery we can get following data (streaming) adapters:
- relational databases (from brewery, supported by SQL alchemy), source+target
- mongodb (from brewery), source+target
- CSV (from swiss), source + target
- google docs (from swiss), srouce only
- HTML (from swiss), source + target
- json (from swiss), source + target
- xls (from swiss), source
USAGE
The data framework is required for:
- data preview
- metadata discovery
- worker process for resource uploading: "it should work in similar way how document uploading on scribd/slideshare works, from user's perspective. resource is queued and not only mirrored ( = archived), but also all necessary metadata (preview, fields, ...) is extracted and stored back to ckan". [pudo:] "you upload something and when you come back a day later, magic fun has happened"
- simplified process of quality auditing (probes within streams)
- ability to write abstract structured data cleansers/transformers/analysers in the future
EXAMPLE
Copy from CSV to PostgreSQL, to MongoDB and to nice HTML:
source = brewery.ds.CSVDataSource('transactions.csv')
psql = brewery.ds.RelationalDataTarget(psql_connection, 'transactions')
mongo = brewery.ds.MongoDataTarget(mongo_connection, 'transactions')
html = brewery.ds.HTMLDataDarget('transactions.html')
for record in source_set.records():
transformed = do_something_if_anything_at_all(record)
psql.append(transformed)
mongo.append(transformed)
html.append(transformed)
Audit records in a CSV:
source = brewery.ds.CSVDataSource('transactions.csv')
field_stats = {}
for field in source.fields:
field_stats[field.name] = brewery.dq.FieldStatistics(field)
for record in source.records():
for field, value in record.items():
stat = field_stats[field]
stat.probe(value)
RELATED CKAN TICKETS
Following tickets can benefit from proposed framework:
- Improvements to the dataproxy and the data API: http://knowledgeforge.net/ckan/trac/ticket/888
- Resource format normalization and detection: http://knowledgeforge.net/ckan/trac/ticket/235
- Dataset upload and archiving (master ticket): http://knowledgeforge.net/ckan/trac/ticket/852
REFERENCES
Swiss:
https://bitbucket.org/okfn/swiss/overview
Data Brewery:
Streams: http://databrewery.org/doc/streams.html
Stores: http://databrewery.org/doc/api/datastores.html
CONCLUSION
With the proposed framework we will get:
- abstract structured data handling
- adapter-based modular architecture for implementing data sources (readers) and targets (writers)
- ability to handle large datasets (through streaming)
- foundation for data transformations and analysis
Feel free to report this to other relevant groups if you find it appropriate.
What do you think?
Regards,
Stefan Urbanek
freelance consultant, analyst
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20110103/9a5551d4/attachment-0001.html>
More information about the ckan-dev
mailing list