[okfn-labs] Python Iterator over table (csv) columns

Wed Dec 17 22:35:44 UTC 2014

On Wed, Dec 17, 2014 at 4:04 PM, Paul Walsh <paulywalsh at gmail.com> wrote:
>
> @Stefan,
>
> Thanks for your informative response.
>
> To answer your first questions:
>
> * Size can vary greatly, from say 100KB to 100MB.
>

That is not much. To put you into a speed perspective, to perform:

1. load 200MB newline delimited JSON into a table using COPY FROM
2. create difference between this staging table and target table into a
temporary
3. "upsert" using the temporary and source into a target table + extract
some JSON columns into their respective first-class columns while applyting
~4 indexes on the target table.

takes 4 minutes on unconfigured Postgres 9.3 (not tuned, default options),
2.6GHZ CPU, single thread, from a network drive.

* Data can be both numerical and categorical. Immediate concerns lean
> towards numerical.
>

Then go SQL if I can recommend.

> * In order to perform all potential validations, it will be required to
> run over the data multiple times
>

Can you give me an example of a list of validations that you think should
go in multiple passes?

Sometimes it is good just to focus on your problem, try to create
appropriate SQL query and let the query planner decide how many passes for
the data are necessary. Planner can do service for you by hasing or
materializing... if he thinks that it needs multiple passes on portion or
whole dataset. But sometimes you know better :-) and can let the planner
know...

> * The operations are not particularly time sensitive.
>
>
4 minutes for CSV → Postgres with in-Postgres JSON parsing (JSON data type)
for 200MB UPSERT on out-of-the-box installation then sounds like a fair
time, or not?

> Your suggestion to work with data from an SQL backend is great, and I’m
> going to think it through in more detail.
>
>
Let me know if you have any questions.

> It would certainly help with scaling out validation flows, and some
> validation runs that I’m now thinking how to solve in an efficient manner
> would become trivial with SQL (via SQLAlchemy ORM).
>
>
No, please no ORM. Use just plain SQL Alchemy core. ORM is for
applications, not for data processing.

Cheers,

Stefan
--
I brew data.

Twitter: @Stiivi
Home: http://www.stiivi.com

> On the other hand, I don’t want to introduce complex dependencies for
> small-scale use of the module(s). This however may be a non-issue if using
> SQLAlchemy anyway, as small-scale use could be via SQLite, instead of
> Postgres or whatever.
>
>
> @Tom and @Edgar,
>
> I was consciously avoiding Pandas until now out of the perception (my own)
> of it being a pretty sizeable dependency, when all I really want is its
> read_csv interface. But really, it has a very nice features and would
> definitely reduce the code I need to write. Thanks for pointing out
> `usecols`.
>
>
>
> On 17 Dec 2014, at 18:34, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
>
> What sizes of CSV files are we talking about? What is the nature of the
> data – numerical or categorical? Is this an one-time operation or repeated
> operation? If repeated, is it time sensitive or not?
>
> I would start with column oriented unless you are really hitting
> performance issues with it. The may move into a batch based processing of
> the file: read few lines (you can mmap as suggested by tom) and do
> per-column operation on the batch. Still textual, no conversion (I assume
> it because you mentioned validation).
>
> For more sophisticated use I would recommend to reach for a data storage
> solution and make it take care of the data access. This is a task that
> should be delegated out of the python and use python just to interact with
> the external tool and glue other processing pieces together.
>
> I'm coming from categorical data space and relational databases. If your
> case is large CSV files that don't fit into a memory, require validation
> and processing and are mostly categorical of nature, then my easy
> accessible recommendation would be:
>
> 1. Delegate: use a relational database, for example PostgreSQL
> 2. Use "COPY FROM" from a file (if server has access to it) or stdin (if
> streaming over the network) – very fast way of loading data into a database
> table (don't have constraints nor indexes if you are concerned about speed)
> 3. perform all validation in Postgres, preferably by generating queries in
> Python using SQL alchemy. You can even have validation done within single
> SQL query or have one per-column (not always necessary).
>
> I assume you are validating data that you already have have described with
> metadata (table schema validation). From that metadata it might not be that
> difficult to generate a SQLAlchemy construct to throw on a SQL database and
> rely on the database's performance for query execution and disk-memory IO
> operations.
>
> Python is great, but there are tools that have been processing data for
> way much longer time... In that case, use python just as a glue.
>
> Cheers,
>
> Stefan
>
>
> On Wed, Dec 17, 2014 at 10:56 AM, Tom Morris <tfmorris at gmail.com> wrote:
>>
>> On Wed, Dec 17, 2014 at 9:52 AM, Edgar Zanella Alvarenga <e at vaz.io>
>> wrote:
>>>
>>> You can use read_csv from Pandas:
>>>
>>> http://pandas.pydata.org/pandas-docs/version/0.13.1/
>>> generated/pandas.io.parsers.read_csv.html
>>>
>>> usecols : array-like
>>>
>>>     Return a subset of the columns. Results in much faster parsing time
>>> and lower memory usage.
>>>
>>> and pass the columns to the `usecols` argument. If you have a problem
>>> with the size of
>>> the csv file you can read it in chunks with:
>>>
>>> pandas.read_csv(filepath, sep = DELIMITER,skiprows =
>>> INITIAL_LINES_TO_SKIP, chunksize = 10000)
>>>
>>> and change the value INITIAL_LINES_TO_SKIP in your iteration.
>>
>>
>> If you add iterator=True to that, it will return an iterator instead of a
>> DataFrame and you can dispense with the chunksize.  If it's not actually
>> doing incremental reading/parsing (I haven't looked at the implementation),
>> it should be straightforward to add it.
>>
>> There's no way you're going to get away without reading the whole file.
>> The best you can do is economize on parsing time and memory usage.
>>
>> mmap is just a different (more efficient) way of reading the file.  It's
>> still all going to get paged in as you access it.
>>
>> Tom
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>
> --
> --
> http://stefan.agentfarms.net
>
> First they ignore you, then they laugh at you, then they fight you, then
> you win.
> - Mahatma Gandhi
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
>

-- 
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141217/770ca9f7/attachment-0004.html>

[okfn-labs] Python Iterator over table (csv) *columns*

[okfn-labs] Python Iterator over table (csv) columns