[okfn-labs] Python Iterator over table (csv) *columns*

Tom Morris tfmorris at gmail.com
Wed Dec 17 15:56:19 UTC 2014


On Wed, Dec 17, 2014 at 9:52 AM, Edgar Zanella Alvarenga <e at vaz.io> wrote:
>
> You can use read_csv from Pandas:
>
> http://pandas.pydata.org/pandas-docs/version/0.13.1/
> generated/pandas.io.parsers.read_csv.html
>
> usecols : array-like
>
>     Return a subset of the columns. Results in much faster parsing time
> and lower memory usage.
>
> and pass the columns to the `usecols` argument. If you have a problem with
> the size of
> the csv file you can read it in chunks with:
>
> pandas.read_csv(filepath, sep = DELIMITER,skiprows =
> INITIAL_LINES_TO_SKIP, chunksize = 10000)
>
> and change the value INITIAL_LINES_TO_SKIP in your iteration.


If you add iterator=True to that, it will return an iterator instead of a
DataFrame and you can dispense with the chunksize.  If it's not actually
doing incremental reading/parsing (I haven't looked at the implementation),
it should be straightforward to add it.

There's no way you're going to get away without reading the whole file.
The best you can do is economize on parsing time and memory usage.

mmap is just a different (more efficient) way of reading the file.  It's
still all going to get paged in as you access it.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141217/8ba25d6b/attachment-0004.html>


More information about the okfn-labs mailing list