[okfn-labs] Python Iterator over table (csv) *columns*

Paul Walsh paulywalsh at gmail.com
Thu Dec 18 04:49:09 UTC 2014


There will be a web service that could be handling any number of validation requests, as part of a larger application. Obviously jobs will run on a queue outside of the request cycle, but I have to consider memory use and workload in such a scenario.

> On 18 Dec 2014, at 2:40, Tom Morris <tfmorris at gmail.com> wrote:
> 
>> On Wed, Dec 17, 2014 at 4:04 PM, Paul Walsh <paulywalsh at gmail.com> wrote:
>> 
>> * Size can vary greatly, from say 100KB to 100MB. 
>> * Data can be both numerical and categorical. Immediate concerns lean towards numerical.
>> * In order to perform all potential validations, it will be required to run over the data multiple times
>> * The operations are not particularly time sensitive.
> 
> I don't understand the reluctance to read such a minuscule data set into memory.  Machines have gigabytes of memory with these days.  I assumed you were talking about large data such as the 130 GB *compressed* set of CSV files that I'm currently processing with Python (using Hadoop streaming).
> 
> To give you a couple of Python data points to compare to the 4 minute SQL load time for a 200 MB file, I took snippets of two different CSV files to test:
> 
> #1
> 206 MB uncompressed
> 4 million rows
> 5 columns - 1 integer, 4 quoted strings
> Read & parse CSV: 4.8 seconds
> Parse CSV and append rows to list: 10.9 seconds
>   as above + iterate over 20M cells in memory and compute average size: 15.6 seconds
> Read text lines (no CSV parsing): 2.3 seconds
> Read lines & append to list: 3.0 seconds
> 
> #2
> 198 MB
> 1.6 million rows
> 27 columns, mostly numeric
> Read & parse CSV: 4.8 seconds
> Parse CSV and append rows to list: 11.6 seconds
>  as above + iterate over 41.6M cells in memory and compute average size: 19.9 seconds
> Read text lines: 1.1 second
> Read lines & append to list: 1.9 seconds
> 
> All times are the average of three runs (but you can assume that the files were resident in the operating system's I/O buffer cache so no physical I/O was done).
> 
> You could have you entire validation done before you even get the data loaded into a SQL database (not to mention the time you'd waste installing it first).
> 
> Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141218/84ef10cd/attachment-0004.html>


More information about the okfn-labs mailing list