[okfn-labs] Python Iterator over table (csv) *columns*

Paul Walsh paulywalsh at gmail.com
Wed Dec 17 21:04:53 UTC 2014


@Stefan,

Thanks for your informative response.

To answer your first questions:

* Size can vary greatly, from say 100KB to 100MB. 
* Data can be both numerical and categorical. Immediate concerns lean towards numerical.
* In order to perform all potential validations, it will be required to run over the data multiple times
* The operations are not particularly time sensitive.

Your suggestion to work with data from an SQL backend is great, and I’m going to think it through in more detail. 

It would certainly help with scaling out validation flows, and some validation runs that I’m now thinking how to solve in an efficient manner would become trivial with SQL (via SQLAlchemy ORM).

On the other hand, I don’t want to introduce complex dependencies for small-scale use of the module(s). This however may be a non-issue if using SQLAlchemy anyway, as small-scale use could be via SQLite, instead of Postgres or whatever.


@Tom and @Edgar,

I was consciously avoiding Pandas until now out of the perception (my own) of it being a pretty sizeable dependency, when all I really want is its read_csv interface. But really, it has a very nice features and would definitely reduce the code I need to write. Thanks for pointing out `usecols`.



On 17 Dec 2014, at 18:34, Stefan Urbanek <stefan.urbanek at gmail.com <mailto:stefan.urbanek at gmail.com>> wrote:

> What sizes of CSV files are we talking about? What is the nature of the data – numerical or categorical? Is this an one-time operation or repeated operation? If repeated, is it time sensitive or not?
> 
> I would start with column oriented unless you are really hitting performance issues with it. The may move into a batch based processing of the file: read few lines (you can mmap as suggested by tom) and do per-column operation on the batch. Still textual, no conversion (I assume it because you mentioned validation).
> 
> For more sophisticated use I would recommend to reach for a data storage solution and make it take care of the data access. This is a task that should be delegated out of the python and use python just to interact with the external tool and glue other processing pieces together.
> 
> I'm coming from categorical data space and relational databases. If your case is large CSV files that don't fit into a memory, require validation and processing and are mostly categorical of nature, then my easy accessible recommendation would be:
> 
> 1. Delegate: use a relational database, for example PostgreSQL
> 2. Use "COPY FROM" from a file (if server has access to it) or stdin (if streaming over the network) – very fast way of loading data into a database table (don't have constraints nor indexes if you are concerned about speed)
> 3. perform all validation in Postgres, preferably by generating queries in Python using SQL alchemy. You can even have validation done within single SQL query or have one per-column (not always necessary).
> 
> I assume you are validating data that you already have have described with metadata (table schema validation). From that metadata it might not be that difficult to generate a SQLAlchemy construct to throw on a SQL database and rely on the database's performance for query execution and disk-memory IO operations.
> 
> Python is great, but there are tools that have been processing data for way much longer time... In that case, use python just as a glue.
> 
> Cheers,
> 
> Stefan
> 
> 
> On Wed, Dec 17, 2014 at 10:56 AM, Tom Morris <tfmorris at gmail.com <mailto:tfmorris at gmail.com>> wrote:
> On Wed, Dec 17, 2014 at 9:52 AM, Edgar Zanella Alvarenga <e at vaz.io <mailto:e at vaz.io>> wrote:
> You can use read_csv from Pandas:
> 
> http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html <http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html>
> 
> usecols : array-like
> 
>     Return a subset of the columns. Results in much faster parsing time and lower memory usage.
> 
> and pass the columns to the `usecols` argument. If you have a problem with the size of
> the csv file you can read it in chunks with:
> 
> pandas.read_csv(filepath, sep = DELIMITER,skiprows = INITIAL_LINES_TO_SKIP, chunksize = 10000)
> 
> and change the value INITIAL_LINES_TO_SKIP in your iteration.
> 
> If you add iterator=True to that, it will return an iterator instead of a DataFrame and you can dispense with the chunksize.  If it's not actually doing incremental reading/parsing (I haven't looked at the implementation), it should be straightforward to add it.
> 
> There's no way you're going to get away without reading the whole file.  The best you can do is economize on parsing time and memory usage.
> 
> mmap is just a different (more efficient) way of reading the file.  It's still all going to get paged in as you access it.
> 
> Tom
> 
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
> https://lists.okfn.org/mailman/listinfo/okfn-labs <https://lists.okfn.org/mailman/listinfo/okfn-labs>
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs <https://lists.okfn.org/mailman/options/okfn-labs>
> 
> 
> 
> -- 
> --
> http://stefan.agentfarms.net <http://stefan.agentfarms.net/>
> 
> First they ignore you, then they laugh at you, then they fight you, then
> you win.
> - Mahatma Gandhi
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
> https://lists.okfn.org/mailman/listinfo/okfn-labs <https://lists.okfn.org/mailman/listinfo/okfn-labs>
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs <https://lists.okfn.org/mailman/options/okfn-labs>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141217/11d280e1/attachment-0004.html>


More information about the okfn-labs mailing list