Getting Started with Python for Data Scientists

With the R Users DC Meetup broadening its topic base to include other statistical programming tools, it seemed only reasonable to write a meta post highlighting some of the best Python tutorials and resources available for data science and statistics. What you don’t know is often the hardest part of picking up a new skill, so hopefully these resources will help make learning Python a little easier. Prepare yourself for code indentation heaven.

Python is such an incredible language because it can do practically anything, from high performance scientific computing to web frameworks such as Django or Flask.  Python is heavily used at Google so the language must be doing something right. And, similar to R, Python has a fantastic community around it and, luckily for you, this community can write. Don’t just take my word for it, watch the following video to fully understand.

 

python

Distributions

Python is available for free from http://www.python.org/ and there are two popular versions, 2.7 or 3.x.  Which should you choose? I would either go with whatever is currently installed on your system or 2.7. For a better discusion, check out this site.

Commercial distributions are also available that have included and tested various useful packages such as the Enthought Python Distribution. This distribution provides a comprehensive, cross-platform environment for scientific computing with the Python programming language. A single-click installer allows immediate access to over 100 libraries and tools. Our open source initiatives include SciPy,NumPy, and the Enthought Tool Suite.

Python Developer Tools

Getting started with a new programming language often requires getting started with a new tool to use the language, unless you are a hardcore VI, VIM, or EMACS person. Python is no exception and there are a great number of editors or full-blown IDEs to try out:

Sublime Text2 - If you have never used it, you should try this editor. “Sublime Text is a sophisticated text editor for code, markup and prose. You’ll love the slick user interface, extraordinary features and amazing performance.”

IPython provides a rich architecture for interactive computing with:

  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

NINJA-IDE  (free) (from the recursive acronym: “Ninja-IDE Is Not Just Another IDE”), “is a cross-platform integrated development environment (IDE). NINJA-IDE runs on Linux/X11, Mac OS X and Windows desktop operating systems, and allows developers to create applications for several purposes using all the tools and utilities of NINJA-IDE, making the task of writing software easier and more enjoyable.”

PyCharm by Jetbrains (not free) – the folks at Jetbrains make great tools and PyCharm is no exception.

 

Learning Python

Learn about Packages

Python is known for it’s “batteries included” philosophy and has a rich standard library. However, being a popular language, the number of third party packages is much larger than the number of standard library packages. So it eventually becomes necessary to discover how packages are used, found and created in Python

 

Package Management and Installation

Once you know a bit about packages, you will start installing them. There is no better ways to get this done than with either the EasyInstall or PIP package managers. It is recommended that you use PIP as it newer and seems to have larger support.

For Windows users sometimes it helpful to use the pre-built binaries maintained here: http://www.lfd.uci.edu/~gohlke/pythonlibs/

You will notice that not all packages have been ported to 3.x. This is true of many popular libraries and it is why 2.6 or 2.7 is recommended.

Virtualenv – learn it early and use it

Package management can be a pain point when working across systems or when deploying larger applications in production environments. For this reason it is  HIGHLY RECOMMENDED that you get comfortable with the wonderful virtualenv package. Here is a good intro to virtualenv for ubuntu (for the windows users… well just go install ubuntu) . The basic idea is that each of your projects gets a self-contained python environment which can be shipped to a new machine and carry its Gordian knot of dependencies with it.

Python Koans – the zen of python

This project is great for those who want to dive right in. It is based on a ruby project which presents the language as a series of failed unit tests. You must edit the source until the unit test passes. It is wonderful and is an introduction to TTD(Test Driven Development) while you learn python.

https://github.com/gregmalcolm/python_koans/wiki

 

Python the Hard Way 

Yes, here is an entire book on python for free online or you can upgrade for even more content and videos. And yes, the book is pretty good.

Welcome to the 3rd Edition of Learn Python the hard way. You can visit the companion site to the book at http://learnpythonthehardway.org/ where you can purchase digital downloads and paper versions of the book. The free HTML version of the book is available at http://learnpythonthehardway.org/book/.

 

Python’s Execution Model
If you want to dive deeper into the underlying execution model of Python, there is no better place to start than this fantastic post:

Those new to Python are often surprised by the behavior of their own code. They expect A but, seemingly for no reason, B happens instead. The root cause of many of these “surprises” is confusion about the Python execution model. It’s the sort of thing that, if it’s explained to you once, a number of Python concepts that seemed hazy before become crystal clear. It’s also really difficult to just “figure out” on your own, as it requires a fundamental shift in thinking about core language concepts like variables, objects, and functions.

In this post, I’ll help you understand what’s happening behind the scenes when you do common things like creating a variable or calling a function. As a result, you’ll write cleaner, more comprehensible code. You’ll also become a better (and faster) code reader. All that’s necessary is to forget everything you know about programming…

Python for Numerical and Scientific Computing

NumPy, SciPy, and matplotlib form the basis for scientific computing in Python.

NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

 

SciPy

SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world’s leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, give SciPy a try!

 

Matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®), web application servers, and six graphical user interface toolkits.

 

Python for Data

Pandas

Pandas is really the Python approximation to R, although most would argue that it isn’t yet as full featured as R. Or, in the words of the website, ”pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.”

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.

 

Statsmodels

Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python.

 

The following two tabs change content below.

Sean Murphy

Senior Scientist and Data Science Consultant at JHU
Sean Patrick Murphy, with degrees in math, electrical engineering, and biomedical engineering and an MBA from Oxford, has served as a senior scientist at Johns Hopkins University for over a decade, advises several startups, and provides learning analytics consulting for EverFi. Previously, he served as the Chief Data Scientist at a series A funded health care analytics firm, and the Director of Research at a boutique graduate educational company. He has also cofounded a big data startup and Data Community DC, a 2,000 member organization of data professionals. Find him on LinkedIn, Twitter, and .
This entry was posted in Python, Reviews, Statistical Programming DC, Tutorials and tagged , . Bookmark the permalink.

11 Pingbacks/Trackbacks

  • http://twitter.com/richmanmax max richman

    Great post. Would also add PyScripter as a solid IDE for Windows users (by choice or necessity) as well as PortablePython for running python scripts from a USB without needing adminstrative access. Can you tell that I work in and with Government?

  • http://twitter.com/webbedfeet Abhijit Dasgupta

    One more python package that is increasingly useful for data scientists is the actively developed scikit-learn (http://scikit-learn.org) package, which has put several popular ML methods under one roof. Not necessarily the fastest, but pretty darn good.

  • Sean Murphy

    With all of these great recommendations, we will have to do a follow up post. Keep ‘em coming.

  • http://twitter.com/webbedfeet Abhijit Dasgupta

    Actually one more…rpy2, which is a python to R interface that works pretty well. It’s developed in Linux and so works fine there. Windows was a bear to install, until I discovered someone with a solution (http://goo.gl/0WDEq). Get the best of both worlds here.

  • http://twitter.com/webbedfeet Abhijit Dasgupta

    An additional comment for Mac users. The pydata ecosystem outlined here can be “difficult” to install because of dependencies on Mac OSX. Chris Fonnesbeck has done a bang-up job putting them all together in a “Scipy Superpack”, downloadable from github. (http://fonnesbeck.github.com/ScipySuperpack/). Makes life immeasurably easier. Includes PyMC (for MCMC simulations), and includes cutting edge versions of the packages.

  • Pingback: Getting started with Python for data science | BInalytics

  • http://twitter.com/aDataHead A Data Head

    I would add ‘Think Python’ (intro to Python), ‘Think Complexity’ (data structures and algorithms in Python) and ‘Think Stats’ (intro to stats/Bayes’ stats in Python) all by Allen B Downey and available free at http://www.greenteapress.com

  • http://www.facebook.com/AjitJaokar Ajit Jaokar

    cool! thanks! love this

  • Hugo Shi

    Hi, you should really checkout the anaconda python distribution, produced by Continuum Analytics (I work for them). But our distribution is free, well supported, and provides way more packages than EPD free (things like scikits.learn)

    http://continuum.io/downloads.html

    Our ceo is the guy that wrote NumPy, so that gives you an idea of the perspective we come from.

    • Eli Bressert

      I second Hugo’s statement. If you go with the free version of Anaconda (AnacondaCE) it offers more functionality than the free version of EPD. For speed demons will be in for a treat with Anaconda as well. You will get Numba installed without all the hassles of source installation. Win-win in my opinion.

    • http://twitter.com/webbedfeet Abhijit Dasgupta

      I’ve actually been playing a bit with Anaconda, and like it quite a bit.

  • Pingback: python

  • Pingback: Getting Started with Python for Data Scientists | thoughts...

  • Majid al-Dosari

    http://www.pythonxy.com gives you alot of this stuff in a distribution +GUI +database access +IDE

  • Pingback: Data roundup, March 20 | School of Data - Evidence is Power

  • Pingback: A Couple Good Python Resources | Data Science 101

  • Matthew Batterton

    I use spyder for my every day IDE. It offers great auto completion, pylint checking, and optional pep8 checking

    https://code.google.com/p/spyderlib/

  • Pingback: Python Northeast: The Scientist-Programmer | Tangible Life

  • Bill Bruns

    It is surprising that this article/metapost highlights the EPD distribution. Recently at the PyData conference, I had opportunity to use all three free commercial distributions listed in these comments: Anaconda, Python(xy), and EPD. I agree with the other comments here, that both Anaconda and Python(xy) both include the numpy and scipy packages, but unfortunately EPD is lacking – it does not contain these – for EPD they must be installed seperately. Even the EPD dowload for the PyData conference does not contain them. Strange but true.

  • http://twitter.com/acannedham A Canned Ham

    No post about data analysis and python should go by without mentioning NLTK for text analytics.

  • buttonwoodth

    Cool!

  • Pingback: Three most popular resources for teaching computing | E-Learning Blog

  • Pingback: Hand Curated Readings on Design, Web and Startups – Week 3 by Self Brewed Coffee

  • Pingback: 2013 年python 精华集锦 | 小葛叔叔'sBlog

  • Pingback: Python 2013 精彩回顾:新闻、好文和资源 – 农夫庄园

  • Pingback: 过去的2013之Python | 92UR铀报