[School-of-data] Pandas and R, was: Introduction -- Simon Cropper

Tom Roche Tom_Roche at pobox.com
Thu Jun 19 15:53:34 UTC 2014

Simon Cropper Thu, Jun 19, 2014 at 02:11:06PM +1000
>> I have the bulk of this information at my fingertips at the present
>> having done a comprehensive review of the resources available[, including
>> integrating] R with Pandas, and also how to do stuff in Pandas the same way as R.

Michael Bauer Thu, 19 Jun 2014 10:51:03 +0200
> I'd [publish] this on the School of Data blog[, targeting]
> people who know a bit how to program and want to get into data analysis more.

+1, also targeting those who want a more python-centric workflow. Why open environmental scientists (and journalists) might also want this:

For much of spatiotemporal environmental modeling, especially big-data/global-scale, the netCDF[1] data format rules. For that, the netCDF community has developed a DSL called NCL[2], which is in fact highly usable for basic data manipulation (and particularly, IMHO, more usable than its netCDF-targeting competitors, e.g., [4][5][6]) ... but, being domain-specific, has nowhere near the analytic and graphical power of R. But to integrate workflows, one requires a tool with systems capabilities that both NCL and R essentially lack. Unfortunately, when I started doing this

- Enthought's licensing of NumPy/SciPy was inadequate for the administrators of the systems on which I run (and from which it's painful to move GB/TB-scale data in/out)

- R/Python interfaces were considered painful by the folks from whom I was learning how to do this

Fortunately bash will integrate without undue pain, and is "available everywhere," but lacks the test and documentation hooks which this modern coder wants. Hence I'd prefer to have python run the workflow, driving NCL and R only as needed. 

Scientific python distributions are now ~= free and available on the systems I use. Both NCL and R have python bindings, so I could ("when I have time" :-( just start driving with python, but rpy/rpy2[3] in particular still look a bit painful. Given that most of what I do with R seems "vanilla" (basic statistics, lattice graphics), I'm hoping one could do "that sort of thing" in Pandas, and more easily/performantly than one could by driving R from python. Hence I'm definitely interested in "how to do stuff in Pandas the same way as R," and of course "how to do stuff in Pandas even more easily" :-)

The other open question for me in this space regards python/NCL/netCDF, i.e., whether to

* manipulate netCDF from python directly (with, e.g., module=netCDF4[4]). Unfortunately this seems to suffer the same netCDF usability delta as does the equivalent R package[5], while being not as usably packaged.

* just drive NCL from python, in a manner similar to which I drive NCL from bash.

but that's a question for another thread, and probably OT for this list. (If so, feel free to ping me directly regarding python/NCL/netCDF.)

TIA, Tom Roche <Tom_Roche at pobox.com>

[1]: http://en.wikipedia.org/wiki/NetCDF
[2]: https://en.wikipedia.org/wiki/NCAR_Command_Language
[3]: http://rpy.sourceforge.net/
[4]: http://unidata.github.io/netcdf4-python/
[5]: http://cran.r-project.org/web/packages/ncdf4/
[6]: http://nco.sourceforge.net/

More information about the school-of-data mailing list