[okfn-labs] Python Iterator over table (csv) *columns*

Stefan Urbanek stefan.urbanek at gmail.com
Wed Dec 17 16:44:37 UTC 2014


Correction: "I would start with column oriented" should read "I would start
with row oriented approach"

p.s.: As far as I know, Python does not have a lightweight out-of-core
("does not fit into memory") processing yet. Blaze is trying to achieve
that, but I would not call it lightweight. And in that case there is no
difference between delegating to a mature external tool and using a python
library with huge bandwagon of dependencies.

On Wed, Dec 17, 2014 at 11:34 AM, Stefan Urbanek <stefan.urbanek at gmail.com>
wrote:
>
> What sizes of CSV files are we talking about? What is the nature of the
> data – numerical or categorical? Is this an one-time operation or repeated
> operation? If repeated, is it time sensitive or not?
>
> I would start with column oriented unless you are really hitting
> performance issues with it. The may move into a batch based processing of
> the file: read few lines (you can mmap as suggested by tom) and do
> per-column operation on the batch. Still textual, no conversion (I assume
> it because you mentioned validation).
>
> For more sophisticated use I would recommend to reach for a data storage
> solution and make it take care of the data access. This is a task that
> should be delegated out of the python and use python just to interact with
> the external tool and glue other processing pieces together.
>
> I'm coming from categorical data space and relational databases. If your
> case is large CSV files that don't fit into a memory, require validation
> and processing and are mostly categorical of nature, then my easy
> accessible recommendation would be:
>
> 1. Delegate: use a relational database, for example PostgreSQL
> 2. Use "COPY FROM" from a file (if server has access to it) or stdin (if
> streaming over the network) – very fast way of loading data into a database
> table (don't have constraints nor indexes if you are concerned about speed)
> 3. perform all validation in Postgres, preferably by generating queries in
> Python using SQL alchemy. You can even have validation done within single
> SQL query or have one per-column (not always necessary).
>
> I assume you are validating data that you already have have described with
> metadata (table schema validation). From that metadata it might not be that
> difficult to generate a SQLAlchemy construct to throw on a SQL database and
> rely on the database's performance for query execution and disk-memory IO
> operations.
>
> Python is great, but there are tools that have been processing data for
> way much longer time... In that case, use python just as a glue.
>
> Cheers,
>
> Stefan
>
>
> On Wed, Dec 17, 2014 at 10:56 AM, Tom Morris <tfmorris at gmail.com> wrote:
>
>> On Wed, Dec 17, 2014 at 9:52 AM, Edgar Zanella Alvarenga <e at vaz.io>
>> wrote:
>>>
>>> You can use read_csv from Pandas:
>>>
>>> http://pandas.pydata.org/pandas-docs/version/0.13.1/
>>> generated/pandas.io.parsers.read_csv.html
>>>
>>> usecols : array-like
>>>
>>>     Return a subset of the columns. Results in much faster parsing time
>>> and lower memory usage.
>>>
>>> and pass the columns to the `usecols` argument. If you have a problem
>>> with the size of
>>> the csv file you can read it in chunks with:
>>>
>>> pandas.read_csv(filepath, sep = DELIMITER,skiprows =
>>> INITIAL_LINES_TO_SKIP, chunksize = 10000)
>>>
>>> and change the value INITIAL_LINES_TO_SKIP in your iteration.
>>
>>
>> If you add iterator=True to that, it will return an iterator instead of a
>> DataFrame and you can dispense with the chunksize.  If it's not actually
>> doing incremental reading/parsing (I haven't looked at the implementation),
>> it should be straightforward to add it.
>>
>> There's no way you're going to get away without reading the whole file.
>> The best you can do is economize on parsing time and memory usage.
>>
>> mmap is just a different (more efficient) way of reading the file.  It's
>> still all going to get paged in as you access it.
>>
>> Tom
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>
> --
> --
> http://stefan.agentfarms.net
>
> First they ignore you, then they laugh at you, then they fight you, then
> you win.
> - Mahatma Gandhi
>


-- 
--
I brew data.

Twitter: @Stiivi
Home: http://www.stiivi.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141217/0e6d41c4/attachment-0004.html>


More information about the okfn-labs mailing list