[School-of-data] Introduction -- Simon Cropper

Simon Cropper simoncropper at fossworkflowguides.com
Thu Jun 19 04:29:58 UTC 2014


Hi Peter,

Thanks for the advice.

For the record, R and Pandas are not mutually exclusive. From what I 
understand Pandas does not do principal components analysis and a raft 
of other types of statistical analysis, so anyone wishing to do these 
types of analysis will need to export the data to R or equivalent 
package for analysis. From what I can work out Pandas seems to be 
quicker at conducting data wrangling than R but I have not seen any 
information to support this opinion.

The developers of Pandas have aimed to help people that may benefit from 
using both tools by providing the following resources...

**Pandas Documentation -- Comparison with R / R libraries**

"Since pandas aims to provide a lot of the data manipulation and 
analysis functionality that people use R for, this page was started to 
provide a more detailed look at the R language and its many third party 
libraries as they relate to pandas. In comparisons with R and CRAN 
libraries, we care about the following things:

   Functionality / flexibility: what can/cannot be done with each tool
   Performance: how fast are operations. Hard numbers/benchmarks are
   preferable

   Ease-of-use: Is one tool easier/harder to use (you may have to be
   the judge of this, given side-by-side code comparisons)

This page is also here to offer a bit of a translation guide for users 
of these R packages."

http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html

**Pandas Documentation -- rpy2 / R interface**

"If your computer has R and rpy2 (> 2.2) installed (which will be left 
to the reader), you will be able to leverage the below functionality. On 
Windows, doing this is quite an ordeal at the moment, but users on 
Unix-like systems should find it quite easy. rpy2 evolves in time, and 
is currently reaching its release 2.3, while the current interface is 
designed for the 2.2.x series. We recommend to use 2.2.x over other 
series unless you are prepared to fix parts of the code, yet the 
rpy2-2.3.0 introduces improvements such as a better R-Python bridge 
memory management layer so it might be a good idea to bite the bullet 
and submit patches for the few minor differences that need to be fixed."

http://pandas.pydata.org/pandas-docs/stable/r_interface.html

In my opinion, data wranglers, data miners and data analyst should use 
whatever tools meet their needs. The more in your arsenal the better the 
outcome is likely to be?

On 17/06/14 18:16, Peter Murray-Rust wrote:
> I would go with R as it has (I assume) a much larger user community and
> that's what really matters. I have copied in Karthik who runs workshops
> on R and who should be able to give better advice than me.
>
> There are probably > 1000 statistical techniques to choose from. Here
> the most important thing IMO is to get a feel for your data. Using
> automatic methods without knowing what you are doing is worse than useless.
>
> Assuming you have a rectaugular data set (column headings are the type
> of observation and rows the observations) and no missing data then (IMO)
> the first thing is to do principla components. If you are fortunate this
> can be projected onto a 2-D plot which shows the spread of the data.
> This will immediately show outliers which are either interesting or
> (more likely) bad data - typos, wrong units, mismeasurement, etc. (e.g.
> using ounces instead of grams).  When you are confident that you have
> removed the worst errors (and there *will* be errors) you can start to
> use a range of methods.
>
> If you think there are discrete clusters of observations , use
> classification or clustering methods. For example you might separate
> sweet foods from curries by the ingredients. But always keen your brain
> active. I suspect more garbage has been pumped into the scientific
> literature by bad data analysis than bad instruments.
>
> And NEVER confuse cause with correlation. It's easy to find false
> correlations - highway accidents in US correlate perfectly with the
> import of melons from Mexico. Chance. Or at best both are correlated
> with general socioeconmic trends.
>
>
> On Tue, Jun 17, 2014 at 8:38 AM, Michael Bauer <michael.bauer at okfn.org
> <mailto:michael.bauer at okfn.org>> wrote:
>
>     Hi there,
>
>     On Mon, Jun 16, 2014 at 08:32:42AM +0100, Peter Murray-Rust wrote:
>      > > As I am exploring some new tools in Python, I have thought of
>     doing this
>      > > analysis using Pandas or something similar. The code would be
>     integrated
>      > > into iPython Notebooks so others could view the methodology and
>     augment
>      > > where necessary, and managed in a GitHub repository.
>      > >
>      > >
>      > I'm not (yet?) an expert Pythonista but from the description of
>     the problem
>      > it sounds like you will need multivariate statistical methods.
>     There are
>      > lots of libraries - I would probably point you at R but Pandas
>     points you
>      > at http://statsmodels.sourceforge.net. I would probably start with a
>      > Principal Components method to get an idea of the shape of the
>     data - are
>      > there serious outliers, etc. and then move to classification
>     methods -
>      > supervised and unsupervised, binary and multiple. You're almost
>     certainly
>      > going to have to deal with missing data .
>
>     A while back we were thinking about introducing a more advanced
>     framework
>     for everyone who gets bored playing with spreadsheets ;) We were
>     debating
>     on R vs. Python (Although I'm a python programmer I did most of my data
>     work in R (pandas didn't exist when I started out)). Would you want to
>     write a short introduction on python/pandas. What you need to start
>     out and
>     where to find further resources?
>
>     Michael
>
>     --
>     Data Diva | skype: mihi_tr | @mihi_tr
>     Open Knowledge | School of Data
>     http://okfn.org | http://schoolofdata.org
>     GPG/PGP key: http://tentacleriot.eu/mihi.asc
>     _______________________________________________
>     school-of-data mailing list
>     school-of-data at lists.okfn.org <mailto:school-of-data at lists.okfn.org>
>     https://lists.okfn.org/mailman/listinfo/school-of-data
>     Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
>
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>


-- 
Cheers Simon

    Simon Cropper - Open Content Creator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages           http://www.fossworkflowguides.com/gis
    bash / Python    http://www.fossworkflowguides.com/scripting



More information about the school-of-data mailing list