[okfn-labs] Python Iterator over table (csv) *columns*

Edgar Zanella Alvarenga e at vaz.io
Thu Dec 18 01:01:41 UTC 2014


And if you really need performance, I suggest you to look at PyTables
or use Pandas interface for HDF5 files. Some queries will be orders of
magnitude faster in it.

Edgar

On 17/12/2014 22:40, Tom Morris wrote:
> On Wed, Dec 17, 2014 at 4:04 PM, Paul Walsh <paulywalsh at gmail.com 
> [1]>
> wrote:
>
>> * Size can vary greatly, from say 100KB to 100MB. 
>> * Data can be both numerical and categorical. Immediate concerns
>> lean towards numerical.
>> * In order to perform all potential validations, it will be required
>> to run over the data multiple times
>> * The operations are not particularly time sensitive.
>
> I dont understand the reluctance to read such a minuscule data set
> into memory.  Machines have gigabytes of memory with these days.  I
> assumed you were talking about large data such as the 130 GB
> *compressed* set of CSV files that Im currently processing with 
> Python
> (using Hadoop streaming).
>
> To give you a couple of Python data points to compare to the 4 minute
> SQL load time for a 200 MB file, I took snippets of two different CSV
> files to test:
>
> #1
> 206 MB uncompressed
> 4 million rows
> 5 columns - 1 integer, 4 quoted strings
> Read & parse CSV: 4.8 seconds
> Parse CSV and append rows to list: 10.9 seconds
>   as above + iterate over 20M cells in memory and compute average
> size: 15.6 seconds
> Read text lines (no CSV parsing): 2.3 seconds
> Read lines & append to list: 3.0 seconds
>
> #2
> 198 MB
> 1.6 million rows
> 27 columns, mostly numeric
> Read & parse CSV: 4.8 seconds
>
> Parse CSV and append rows to list: 11.6 seconds
>  as above + iterate over 41.6M cells in memory and compute average
> size: 19.9 seconds
> Read text lines: 1.1 second
> Read lines & append to list: 1.9 seconds
>
> All times are the average of three runs (but you can assume that the
> files were resident in the operating systems I/O buffer cache so no
> physical I/O was done).
>
> You could have you entire validation done before you even get the 
> data
> loaded into a SQL database (not to mention the time youd waste
> installing it first).
>
> Tom
>
> Links:
> ------
> [1] mailto:paulywalsh at gmail.com




More information about the okfn-labs mailing list