[School-of-data] Introduction -- Simon Cropper
Simon Cropper
simoncropper at fossworkflowguides.com
Thu Jun 19 04:29:58 UTC 2014
Hi Peter,
Thanks for the advice.
For the record, R and Pandas are not mutually exclusive. From what I
understand Pandas does not do principal components analysis and a raft
of other types of statistical analysis, so anyone wishing to do these
types of analysis will need to export the data to R or equivalent
package for analysis. From what I can work out Pandas seems to be
quicker at conducting data wrangling than R but I have not seen any
information to support this opinion.
The developers of Pandas have aimed to help people that may benefit from
using both tools by providing the following resources...
**Pandas Documentation -- Comparison with R / R libraries**
"Since pandas aims to provide a lot of the data manipulation and
analysis functionality that people use R for, this page was started to
provide a more detailed look at the R language and its many third party
libraries as they relate to pandas. In comparisons with R and CRAN
libraries, we care about the following things:
Functionality / flexibility: what can/cannot be done with each tool
Performance: how fast are operations. Hard numbers/benchmarks are
preferable
Ease-of-use: Is one tool easier/harder to use (you may have to be
the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users
of these R packages."
http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html
**Pandas Documentation -- rpy2 / R interface**
"If your computer has R and rpy2 (> 2.2) installed (which will be left
to the reader), you will be able to leverage the below functionality. On
Windows, doing this is quite an ordeal at the moment, but users on
Unix-like systems should find it quite easy. rpy2 evolves in time, and
is currently reaching its release 2.3, while the current interface is
designed for the 2.2.x series. We recommend to use 2.2.x over other
series unless you are prepared to fix parts of the code, yet the
rpy2-2.3.0 introduces improvements such as a better R-Python bridge
memory management layer so it might be a good idea to bite the bullet
and submit patches for the few minor differences that need to be fixed."
http://pandas.pydata.org/pandas-docs/stable/r_interface.html
In my opinion, data wranglers, data miners and data analyst should use
whatever tools meet their needs. The more in your arsenal the better the
outcome is likely to be?
On 17/06/14 18:16, Peter Murray-Rust wrote:
> I would go with R as it has (I assume) a much larger user community and
> that's what really matters. I have copied in Karthik who runs workshops
> on R and who should be able to give better advice than me.
>
> There are probably > 1000 statistical techniques to choose from. Here
> the most important thing IMO is to get a feel for your data. Using
> automatic methods without knowing what you are doing is worse than useless.
>
> Assuming you have a rectaugular data set (column headings are the type
> of observation and rows the observations) and no missing data then (IMO)
> the first thing is to do principla components. If you are fortunate this
> can be projected onto a 2-D plot which shows the spread of the data.
> This will immediately show outliers which are either interesting or
> (more likely) bad data - typos, wrong units, mismeasurement, etc. (e.g.
> using ounces instead of grams). When you are confident that you have
> removed the worst errors (and there *will* be errors) you can start to
> use a range of methods.
>
> If you think there are discrete clusters of observations , use
> classification or clustering methods. For example you might separate
> sweet foods from curries by the ingredients. But always keen your brain
> active. I suspect more garbage has been pumped into the scientific
> literature by bad data analysis than bad instruments.
>
> And NEVER confuse cause with correlation. It's easy to find false
> correlations - highway accidents in US correlate perfectly with the
> import of melons from Mexico. Chance. Or at best both are correlated
> with general socioeconmic trends.
>
>
> On Tue, Jun 17, 2014 at 8:38 AM, Michael Bauer <michael.bauer at okfn.org
> <mailto:michael.bauer at okfn.org>> wrote:
>
> Hi there,
>
> On Mon, Jun 16, 2014 at 08:32:42AM +0100, Peter Murray-Rust wrote:
> > > As I am exploring some new tools in Python, I have thought of
> doing this
> > > analysis using Pandas or something similar. The code would be
> integrated
> > > into iPython Notebooks so others could view the methodology and
> augment
> > > where necessary, and managed in a GitHub repository.
> > >
> > >
> > I'm not (yet?) an expert Pythonista but from the description of
> the problem
> > it sounds like you will need multivariate statistical methods.
> There are
> > lots of libraries - I would probably point you at R but Pandas
> points you
> > at http://statsmodels.sourceforge.net. I would probably start with a
> > Principal Components method to get an idea of the shape of the
> data - are
> > there serious outliers, etc. and then move to classification
> methods -
> > supervised and unsupervised, binary and multiple. You're almost
> certainly
> > going to have to deal with missing data .
>
> A while back we were thinking about introducing a more advanced
> framework
> for everyone who gets bored playing with spreadsheets ;) We were
> debating
> on R vs. Python (Although I'm a python programmer I did most of my data
> work in R (pandas didn't exist when I started out)). Would you want to
> write a short introduction on python/pandas. What you need to start
> out and
> where to find further resources?
>
> Michael
>
> --
> Data Diva | skype: mihi_tr | @mihi_tr
> Open Knowledge | School of Data
> http://okfn.org | http://schoolofdata.org
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org <mailto:school-of-data at lists.okfn.org>
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
>
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>
--
Cheers Simon
Simon Cropper - Open Content Creator
Free and Open Source Software Workflow Guides
------------------------------------------------------------
Introduction http://www.fossworkflowguides.com
GIS Packages http://www.fossworkflowguides.com/gis
bash / Python http://www.fossworkflowguides.com/scripting
More information about the school-of-data
mailing list