[okfn-labs] Python Iterator over table (csv) *columns*

Stefan Urbanek stefan.urbanek at gmail.com
Thu Dec 18 06:52:56 UTC 2014


> On 17 Dec 2014, at 23:49, Paul Walsh <paulywalsh at gmail.com> wrote:
> 
> There will be a web service that could be handling any number of validation requests, as part of a larger application. Obviously jobs will run on a queue outside of the request cycle, but I have to consider memory use and workload in such a scenario.
> 

If you consider going with the SQL option I would suggest to create a simple meta-data based management of tables. Since the jobs will run on a queue, you have full control of it then you don’t have to be worried about using temporary tables with "drop on commit” turned on and wrap it in one transaction per job. Do validation in the temporaries within the transaction, if you don’t need to keep the data. This also allows you to do more rich validation by keeping more data in other tables within the same database and just use relational features for certain to incorporate referential data/master data in the validation process if you need them. Relational databases are really good in set operations...

Don’t waste time on memory management if you don’t have to. It is good to be aware of it, but not too early. From your file sizes it does not look like you should be worried right now about it. Also, there are people who solved the memory problem for you already, therefore focus on your problem: validation.

Just my humble opinion,

Stefan


> On 18 Dec 2014, at 2:40, Tom Morris <tfmorris at gmail.com <mailto:tfmorris at gmail.com>> wrote:
> 
>> On Wed, Dec 17, 2014 at 4:04 PM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
>> 
>> * Size can vary greatly, from say 100KB to 100MB. 
>> * Data can be both numerical and categorical. Immediate concerns lean towards numerical.
>> * In order to perform all potential validations, it will be required to run over the data multiple times
>> * The operations are not particularly time sensitive.
>> 
>> I don't understand the reluctance to read such a minuscule data set into memory.  Machines have gigabytes of memory with these days.  I assumed you were talking about large data such as the 130 GB *compressed* set of CSV files that I'm currently processing with Python (using Hadoop streaming).
>> 
>> To give you a couple of Python data points to compare to the 4 minute SQL load time for a 200 MB file, I took snippets of two different CSV files to test:
>> 
>> #1
>> 206 MB uncompressed
>> 4 million rows
>> 5 columns - 1 integer, 4 quoted strings
>> Read & parse CSV: 4.8 seconds
>> Parse CSV and append rows to list: 10.9 seconds
>>   as above + iterate over 20M cells in memory and compute average size: 15.6 seconds
>> Read text lines (no CSV parsing): 2.3 seconds
>> Read lines & append to list: 3.0 seconds
>> 
>> #2
>> 198 MB
>> 1.6 million rows
>> 27 columns, mostly numeric
>> Read & parse CSV: 4.8 seconds
>> Parse CSV and append rows to list: 11.6 seconds
>>  as above + iterate over 41.6M cells in memory and compute average size: 19.9 seconds
>> Read text lines: 1.1 second
>> Read lines & append to list: 1.9 seconds
>> 
>> All times are the average of three runs (but you can assume that the files were resident in the operating system's I/O buffer cache so no physical I/O was done).
>> 
>> You could have you entire validation done before you even get the data loaded into a SQL database (not to mention the time you'd waste installing it first).
>> 
>> Tom
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141218/9db45d18/attachment-0004.html>


More information about the okfn-labs mailing list