[School-of-data] Introduction -- Simon Cropper

Peter Murray-Rust pm286 at cam.ac.uk
Tue Jun 17 08:16:47 UTC 2014

I would go with R as it has (I assume) a much larger user community and
that's what really matters. I have copied in Karthik who runs workshops on
R and who should be able to give better advice than me.

There are probably > 1000 statistical techniques to choose from. Here the
most important thing IMO is to get a feel for your data. Using automatic
methods without knowing what you are doing is worse than useless.

Assuming you have a rectaugular data set (column headings are the type of
observation and rows the observations) and no missing data then (IMO) the
first thing is to do principla components. If you are fortunate this can be
projected onto a 2-D plot which shows the spread of the data. This will
immediately show outliers which are either interesting or (more likely) bad
data - typos, wrong units, mismeasurement, etc. (e.g. using ounces instead
of grams).  When you are confident that you have removed the worst errors
(and there *will* be errors) you can start to use a range of methods.

If you think there are discrete clusters of observations , use
classification or clustering methods. For example you might separate sweet
foods from curries by the ingredients. But always keen your brain active. I
suspect more garbage has been pumped into the scientific literature by bad
data analysis than bad instruments.

And NEVER confuse cause with correlation. It's easy to find false
correlations - highway accidents in US correlate perfectly with the import
of melons from Mexico. Chance. Or at best both are correlated with general
socioeconmic trends.

On Tue, Jun 17, 2014 at 8:38 AM, Michael Bauer <michael.bauer at okfn.org>

> Hi there,
> On Mon, Jun 16, 2014 at 08:32:42AM +0100, Peter Murray-Rust wrote:
> > > As I am exploring some new tools in Python, I have thought of doing
> this
> > > analysis using Pandas or something similar. The code would be
> integrated
> > > into iPython Notebooks so others could view the methodology and augment
> > > where necessary, and managed in a GitHub repository.
> > >
> > >
> > I'm not (yet?) an expert Pythonista but from the description of the
> problem
> > it sounds like you will need multivariate statistical methods. There are
> > lots of libraries - I would probably point you at R but Pandas points you
> > at http://statsmodels.sourceforge.net. I would probably start with a
> > Principal Components method to get an idea of the shape of the data - are
> > there serious outliers, etc. and then move to classification methods -
> > supervised and unsupervised, binary and multiple. You're almost certainly
> > going to have to deal with missing data .
> A while back we were thinking about introducing a more advanced framework
> for everyone who gets bored playing with spreadsheets ;) We were debating
> on R vs. Python (Although I'm a python programmer I did most of my data
> work in R (pandas didn't exist when I started out)). Would you want to
> write a short introduction on python/pandas. What you need to start out and
> where to find further resources?
> Michael
> --
> Data Diva | skype: mihi_tr | @mihi_tr
> Open Knowledge | School of Data
> http://okfn.org | http://schoolofdata.org
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20140617/800de458/attachment-0002.html>

More information about the school-of-data mailing list