[okfn-labs] Python Iterator over table (csv) *columns*
stefan.urbanek at gmail.com
Wed Dec 17 16:34:15 UTC 2014
What sizes of CSV files are we talking about? What is the nature of the
data – numerical or categorical? Is this an one-time operation or repeated
operation? If repeated, is it time sensitive or not?
I would start with column oriented unless you are really hitting
performance issues with it. The may move into a batch based processing of
the file: read few lines (you can mmap as suggested by tom) and do
per-column operation on the batch. Still textual, no conversion (I assume
it because you mentioned validation).
For more sophisticated use I would recommend to reach for a data storage
solution and make it take care of the data access. This is a task that
should be delegated out of the python and use python just to interact with
the external tool and glue other processing pieces together.
I'm coming from categorical data space and relational databases. If your
case is large CSV files that don't fit into a memory, require validation
and processing and are mostly categorical of nature, then my easy
accessible recommendation would be:
1. Delegate: use a relational database, for example PostgreSQL
2. Use "COPY FROM" from a file (if server has access to it) or stdin (if
streaming over the network) – very fast way of loading data into a database
table (don't have constraints nor indexes if you are concerned about speed)
3. perform all validation in Postgres, preferably by generating queries in
Python using SQL alchemy. You can even have validation done within single
SQL query or have one per-column (not always necessary).
I assume you are validating data that you already have have described with
metadata (table schema validation). From that metadata it might not be that
difficult to generate a SQLAlchemy construct to throw on a SQL database and
rely on the database's performance for query execution and disk-memory IO
Python is great, but there are tools that have been processing data for way
much longer time... In that case, use python just as a glue.
On Wed, Dec 17, 2014 at 10:56 AM, Tom Morris <tfmorris at gmail.com> wrote:
> On Wed, Dec 17, 2014 at 9:52 AM, Edgar Zanella Alvarenga <e at vaz.io> wrote:
>> You can use read_csv from Pandas:
>> usecols : array-like
>> Return a subset of the columns. Results in much faster parsing time
>> and lower memory usage.
>> and pass the columns to the `usecols` argument. If you have a problem
>> with the size of
>> the csv file you can read it in chunks with:
>> pandas.read_csv(filepath, sep = DELIMITER,skiprows =
>> INITIAL_LINES_TO_SKIP, chunksize = 10000)
>> and change the value INITIAL_LINES_TO_SKIP in your iteration.
> If you add iterator=True to that, it will return an iterator instead of a
> DataFrame and you can dispense with the chunksize. If it's not actually
> doing incremental reading/parsing (I haven't looked at the implementation),
> it should be straightforward to add it.
> There's no way you're going to get away without reading the whole file.
> The best you can do is economize on parsing time and memory usage.
> mmap is just a different (more efficient) way of reading the file. It's
> still all going to get paged in as you access it.
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
First they ignore you, then they laugh at you, then they fight you, then
- Mahatma Gandhi
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the okfn-labs