[okfn-labs] Python Iterator over table (csv) *columns*
Edgar Zanella Alvarenga
e at vaz.io
Thu Dec 18 01:01:41 UTC 2014
And if you really need performance, I suggest you to look at PyTables
or use Pandas interface for HDF5 files. Some queries will be orders of
magnitude faster in it.
Edgar
On 17/12/2014 22:40, Tom Morris wrote:
> On Wed, Dec 17, 2014 at 4:04 PM, Paul Walsh <paulywalsh at gmail.com
> [1]>
> wrote:
>
>> * Size can vary greatly, from say 100KB to 100MB.
>> * Data can be both numerical and categorical. Immediate concerns
>> lean towards numerical.
>> * In order to perform all potential validations, it will be required
>> to run over the data multiple times
>> * The operations are not particularly time sensitive.
>
> I dont understand the reluctance to read such a minuscule data set
> into memory. Machines have gigabytes of memory with these days. I
> assumed you were talking about large data such as the 130 GB
> *compressed* set of CSV files that Im currently processing with
> Python
> (using Hadoop streaming).
>
> To give you a couple of Python data points to compare to the 4 minute
> SQL load time for a 200 MB file, I took snippets of two different CSV
> files to test:
>
> #1
> 206 MB uncompressed
> 4 million rows
> 5 columns - 1 integer, 4 quoted strings
> Read & parse CSV: 4.8 seconds
> Parse CSV and append rows to list: 10.9 seconds
> as above + iterate over 20M cells in memory and compute average
> size: 15.6 seconds
> Read text lines (no CSV parsing): 2.3 seconds
> Read lines & append to list: 3.0 seconds
>
> #2
> 198 MB
> 1.6 million rows
> 27 columns, mostly numeric
> Read & parse CSV: 4.8 seconds
>
> Parse CSV and append rows to list: 11.6 seconds
> as above + iterate over 41.6M cells in memory and compute average
> size: 19.9 seconds
> Read text lines: 1.1 second
> Read lines & append to list: 1.9 seconds
>
> All times are the average of three runs (but you can assume that the
> files were resident in the operating systems I/O buffer cache so no
> physical I/O was done).
>
> You could have you entire validation done before you even get the
> data
> loaded into a SQL database (not to mention the time youd waste
> installing it first).
>
> Tom
>
> Links:
> ------
> [1] mailto:paulywalsh at gmail.com
More information about the okfn-labs
mailing list