[okfn-labs] Python Iterator over table (csv) *columns*

Tom Morris tfmorris at gmail.com
Thu Dec 18 00:40:41 UTC 2014


On Wed, Dec 17, 2014 at 4:04 PM, Paul Walsh <paulywalsh at gmail.com> wrote:
>
>
> * Size can vary greatly, from say 100KB to 100MB.
> * Data can be both numerical and categorical. Immediate concerns lean
> towards numerical.
> * In order to perform all potential validations, it will be required to
> run over the data multiple times
> * The operations are not particularly time sensitive.
>

I don't understand the reluctance to read such a minuscule data set into
memory.  Machines have gigabytes of memory with these days.  I assumed you
were talking about large data such as the 130 GB *compressed* set of CSV
files that I'm currently processing with Python (using Hadoop streaming).

To give you a couple of Python data points to compare to the 4 minute SQL
load time for a 200 MB file, I took snippets of two different CSV files to
test:

#1
206 MB uncompressed
4 million rows
5 columns - 1 integer, 4 quoted strings
Read & parse CSV: 4.8 seconds
Parse CSV and append rows to list: 10.9 seconds
  as above + iterate over 20M cells in memory and compute average size:
15.6 seconds
Read text lines (no CSV parsing): 2.3 seconds
Read lines & append to list: 3.0 seconds

#2
198 MB
1.6 million rows
27 columns, mostly numeric
Read & parse CSV: 4.8 seconds
Parse CSV and append rows to list: 11.6 seconds
 as above + iterate over 41.6M cells in memory and compute average size:
19.9 seconds
Read text lines: 1.1 second
Read lines & append to list: 1.9 seconds

All times are the average of three runs (but you can assume that the files
were resident in the operating system's I/O buffer cache so no physical I/O
was done).

You could have you entire validation done before you even get the data
loaded into a SQL database (not to mention the time you'd waste installing
it first).

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141217/a71654c8/attachment-0004.html>


More information about the okfn-labs mailing list