[okfn-labs] Python Iterator over table (csv) *columns*

Stefan Urbanek stefan.urbanek at gmail.com
Thu Dec 18 06:34:53 UTC 2014


Stefan Urbanek
—
I brew data
Home: www.stiivi.com
Twitter: @Stiivi

> On 17 Dec 2014, at 19:40, Tom Morris <tfmorris at gmail.com> wrote:
> 
> On Wed, Dec 17, 2014 at 4:04 PM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
> 
> * Size can vary greatly, from say 100KB to 100MB. 
> * Data can be both numerical and categorical. Immediate concerns lean towards numerical.
> * In order to perform all potential validations, it will be required to run over the data multiple times
> * The operations are not particularly time sensitive.
> 
> I don't understand the reluctance to read such a minuscule data set into memory.  Machines have gigabytes of memory with these days.  I assumed you were talking about large data such as the 130 GB *compressed* set of CSV files that I'm currently processing with Python (using Hadoop streaming).
> 
> To give you a couple of Python data points to compare to the 4 minute SQL load time for a 200 MB file, I took snippets of two different CSV files to test:
> 

I apologise, I omitted couple facts: this was actually not a true CSV (just the command COPY treated it as a single-column CSV), 200mb compressed, 1GB uncompressed line-delimited JSON (one document per line) with ~1M lines. The steps in more detail (which I didn’t considered important before):

1. COPY dump FROM stdin
2. INSERT INTO … AS SELECT FROM dump OUTER LEFT JOIN other table
3. UPDATE … FROM dump INNER JOIN ...
4. INSERT INTO ... AS SELECT FROM dump INNER JOIN ...

While during 3. and 4. around 10 fields are extracted (postgresql JSON operator #>>) to a separate column and casted to column’s type and 2 fields are added using some date-time functions.

So I guess those python tests can’t be compared since they are just plain full scans with cheap aggregation of numerical data and counting over textual data. Not very expensive operations.

To compare my example to your example I would have consider only the "COPY FROM” portion which took up to 20 seconds, which is indeed longer than your CSV tests, since this was JSON which needed to be parsed into nested object structure, which is much more expensive than simple numeric array. I don’t think you can achieve this in Python.

So what my example shows is, that if there was validation (in my case it was other operation) which would involve other tables, it is still pretty doable even with the one of most expensive operations such as JOIN (3 times) and JSON parsing + extracting.

Not to mention that Postgres 9.3 has very slow JSON type – it is stored as string and it has to be parsed every time it is accessed. This loading would be much faster on 9.4 with JSONB type.

If you can beat this in Python on a common laptop (as the example above was executed on), I would love to see it.

s.


> #1
> 206 MB uncompressed
> 4 million rows
> 5 columns - 1 integer, 4 quoted strings
> Read & parse CSV: 4.8 seconds
> Parse CSV and append rows to list: 10.9 seconds
>   as above + iterate over 20M cells in memory and compute average size: 15.6 seconds
> Read text lines (no CSV parsing): 2.3 seconds
> Read lines & append to list: 3.0 seconds
> 
> #2
> 198 MB
> 1.6 million rows
> 27 columns, mostly numeric
> Read & parse CSV: 4.8 seconds
> Parse CSV and append rows to list: 11.6 seconds
>  as above + iterate over 41.6M cells in memory and compute average size: 19.9 seconds
> Read text lines: 1.1 second
> Read lines & append to list: 1.9 seconds
> 
> All times are the average of three runs (but you can assume that the files were resident in the operating system's I/O buffer cache so no physical I/O was done).
> 
> You could have you entire validation done before you even get the data loaded into a SQL database (not to mention the time you'd waste installing it first).
> 
> Tom
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141218/3d7b3159/attachment-0004.html>


More information about the okfn-labs mailing list