Python for Data Analysis: The Landscape of Tutorials

Python has been one of the premier general scripting languages, and a major web development language. Numerical and data analysis and scientific programming developed through the packages Numpy and Scipy, which, along with the visualization package Matplotlib formed the basis for an open-source alternative to Matlab. Numpy provided array objects, cross-language integration, linear algebra and other functionalities. Scipy adds to this and provides optimization, linear algebra, optimization, statistics and basic image analysis capabilities. Matplotlib provides sophisticated 2-D and basic 3-D graphics capabilities with Matlab-like syntax.

Python_natalensis_Smith_1840

Further recent development has resulted in a rather complete stack for data manipulation and analysis, that includes Sympy for symbolic mathematics, pandas for data structures and analysis, and IPython as an enhanced console and HTML notebook that also facilitates parallel computation.

An even richer data analysis ecosystem is quickly evolving in Python, led by Enthought and Continuum Analytics and several other independent and associated efforts. We have described this ecosystem here.

This week, as part of Data Community DC meetups, Peter Wang from Continuum Analytics presents about the PyData ecosystem at Statistical Programming DC, and Jonathan Street from NIH is presenting on Scientific Computing in Python at Data Science MD.

How do you get started??!!!

This ecosystem is evolving and exists today, but how do you get started using these tools? Fortunately there are several tutorials available both in video and as presentations that you can use. Hopefully this will put you on the path. This listing is of course incomplete, and may not include your favorite tool. Tell us about it in the comments!!

PyData Workshop 2012

The PyData Workshop 2012 was organized in NYC last October to bring together data scientists, scientists and engineers. It focused on “techniques and tools for management, analytics, and visualization of data of different types and sizes with particular emphasis on big data”. It was primarily sponsored by Continuum Analytics.  The videos for this workshop are aggregated here.

PyData Silicon Valley

The follow up PyData workshop was held alongside PyCon2013 in Santa Clara, CA. The videos for the presentations are available here. The topics at the workshop included tutorials for pandas, matplotlib, PySpark (for cluster computing), scikits-learn, Wise.io, Disco (a MapReduce implementation), Naive Bayes, Nodebox, machine learning in Python, and IPython.

PyData Workshop 2013

The next PyData Workshop will be in Cambridge, MA July 27-28, 2013

Tutorials for Particular Tools

Python for Data Analysis

  1. Getting started from Kaggle.com.

IPython

IPython notebooks have become the de facto standard for presenting Python analyses, as evidenced by the recent Scipy conference. There are several tutorials for learning IPython.

  1. The IPython tutorial
  2. Fernando Perez’s talk on IPython (and video)
  3. PyCon 2012 tutorial
  4. Interesting IPython notebooks
  5. IPython notebook examples

Python Data Analysis Library (pandas)

  1. The 10-minute introduction to pandas
  2. The pandas cookbook
  3. 2012 PyData Workshop
  4. The pandas documentation
  5. Randal Olson’s tutorial
  6. Wes McKinney’s tutorials 1 and 2 on Kaggle.
  7. Hernan Rojas’ tutorial
  8. Tutorials on financial data and time series using pandas

Scikit-learn

  1. 2012 PyData Workshop
  2. Official scikit-learn tutorial
  3. Jacob VanderPlas’ tutorial
  4. PyCon 2013 tutorial on advanced machine learning with scikit-learn
  5. More scikit-learn tutorials.

Matplotlib

  1. Official tutorial
  2. N.P. Rougier’s tutorial from EuroSciPy 2012
  3. Jake VanderPlas’ tutorial from PyData NYC 2012
  4. John Hunter’s Advanced Matplotlib Tutorial from PyData 2012
  5. A tutorial from Scigraph.

Sympy

  1. Official tutorial
  2. SciPy 2013 presentations

Numpy and Scipy

  1. The Guide to Numpy
  2. M. Scott Shell’s Introduction to Numpy and Scipy

Databases from Python

  1. SQLite
  2. MySQL
  3. PostgreSQL

Books

  1. Python for Data Analysis by Wes McKinney
  2. Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant

If you have any additional suggestions, please leave them in the comments section below!

Editors Note: The book images link out to Amazon, of which we are an affiliate. Thus, if you click the link and buy the book, we get a single digit percentage cut of the purchase. So please, click and buy the books ;)

The following two tabs change content below.
Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting. He has a PhD in Biostatistics from the University of Washington and over 40 collaborative peer-reviewed manuscripts, with strong interests in bridging the statistics/machine learning divide. He is always is on the lookout for interesting and challenging projects, and is an enthusiastic speaker and discussant on new and better ways to look at and analyze data. He is a Board Member of Data Community DC and a founding member and co-organizer of Statistical Programming DC (formerly R Users DC).
This entry was posted in Python, Resources, Tutorials and tagged , , . Bookmark the permalink.

One Pingback/Trackback