PyData and More Tools for Getting Started with Python for Data Scientists

It would turn out that people are very interested in learning more about python and our last post, “Getting Started with Python for Data Scientists,” generated a ton of comments and recommendations. So, we wanted to give back those comments and a few more in a new post.  As luck would have it, John Dennison, who helped co-author this post (along with Abhijit), attended both PyCon and PyData and wanted to sneak in some awesome developments he learned at the two conferences.

Screen Shot 2013-03-29 at 8.06.31 PM

PyData was a smaller conference the directly followed PyCon in Santa Clara. It was a great time and it was wonderful to meet hackers from a range of disciplines that all shared a love of python scientific computing stack. Plus, no one got fired.   Given this reception, we definitely want to give back these recommendations and some more of our own.

Also,  Abhijit Dasgupta lent his incredible wealth of knowledge to putting together this post.

More Books

There are even more excellent books available than originally mentioned. One reader reminded us about the “Think” series–Think Python, an intro to the language, Think Complexity, looking at data structures and algorithms, and Think Stats, an intro to statistics–all by Allen B Downey and all available for free at www.greenteapress.com.

Travis Oliphant’s “Guide to Numpy” is also freely available at www.tramy.us. For data scientists, Wes “Pandas” McKinney’s “Python for Data Analysis” is probably a very good read.  And, if you are interested in forking over some money, you might want to check out the following introductory Python/computer science text by John Zelle (Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image to the side and buy the book, we will make approximately $0.43 and retire to a small island).

More IDEs

Spyder

Spyder, which was previously known as Pydee, is yet another IDE for Python but with interactive testing, debugging, auto completion, pylint checking, optional pep8 checking, and more. Looks pretty good and is available for numerous operating systems.

One reader, who might just work for the government, recommended the following two tools:

PyScripter

PyScripter is a free and open-source Python Integrated Development Environment (IDE) created with the ambition to become competitive in functionality with commercial Windows-based IDEs available for other languages. Being built in a compiled language is rather snappier than some of the other Python IDEs and provides an extensive blend of features that make it a productive Python development environment.

Portable Python

Ever needed a full programming language and environment on a USB Key?

Portable Python is a  Python® programming language preconfigured to run directly from any USB storage device, enabling you to have, at any time, a portable programming environment. Just download it, extract to your portable storage device or hard drive and in 10 minutes you are ready to create your next Python® application.

 

More Python Distributions

Who knew there were so many Python distributions out there but apparently there are and many are geared toward number crunching.

Python(x,y)

Majid mentioned this distribution which is geared toward helping people make the switch from Matlab or compiled languages. If you are used to a full blown IDE, this looks great. Installers are available for Windows and Linux.

Python(x,y) is a scientific-oriented Python Distribution based on Qt and Spyder – see the Plugins page. Its purpose is to help scientific programmers used to interpreted languages (such as MATLAB or IDL) or compiled languages (C/C++ or Fortran) to switch to Python. C/C++ or Fortran programmers should appreciate to reuse their code “as is” by wrapping it so it can be called directly from Python scripts.

Enthought Python Distribution (EPD)

EPD “provides scientists with a comprehensive set of tools to perform rigorous data analysis and visualization.” It is cross-platform and easy to install.

Anaconda

Continuum.io produces the cleverly named “Anaconda” distribution, which offers a “completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing.” But, don’t just take their word for it.  Abhijit and Eli both mentioned Anaconda.

I second Hugo’s statement. If you go with the free version of Anaconda (AnacondaCE) it offers more functionality than the free version of EPD. For speed demons will be in for a treat with Anaconda as well. You will get Numba installed without all the hassles of source installation. Win-win in my opinion.

Anaconda is cross-platform, and supplies an installer program called “conda” (naturally) which parallels the functionality of pip or easy_install for a more limited set of data science related python libraries. Anaconda is the brain child of the same person who started Numpy, and was part of Enthought for many years before starting Continuum, so there are parallels with EPD as well as advances.

 

Python and Macs

An additional comment for Mac users. The pydata ecosystem outlined here can be “difficult” to install because of dependencies on Mac OSX. Chris Fonnesbeck has done a bang-up job putting them all together in a “Scipy Superpack”, downloadable from github. (http://fonnesbeck.github.com/S…. Makes life immeasurably easier. Includes PyMC (for MCMC simulations), and includes cutting edge versions of the packages.

 

Python and Data

Scikit-Learn

Scikit-Learn is a wonderful package of machine learning algorithms. It has most of the big name approaches and plenty that you have never heard of. The maintainers have released a great flow chart to guide you through your machine learning conundrum. http://peekaboo-vision.blogspot.com/2013/01/machine-learning-cheat-sheet-for-scikit.html

One of the real selling points to scikit is that the maintainers have done a wonderful job of maintaining a consistent API across the different algorithms. In addition, with the help of IPython’s parallel powers, one of the core commiters, Olivier, gave a wonderful talk at PyData showing scikit’s parallel chops. When the video of the talk is released, we’ll be sure to post.

Rpy2

Rpy2 is a great library that opens up the immense R ecosystem to your python scientific computing applications. By running a R interpreter besides python, you can access R objects, functions and packages with native python code. One word of warning is that data abstraction runs from R to python. That is to say if you pass large amounts of data from python to R it will result in a full copy.  However, if you access R objects from python, Rpy2 does a clever job at maintaining the memory locations of R objects and exposing them to python without duplicating the data.

I personally have found this library extremely helpful. As a heavy R user, this library allows you to access libraries that might not have a direct counterpart in python. Fellow DC2 board member Abhijit says that rpy2 “works pretty well. It’s developed in Linux and so works fine there. Windows was a bear to install, until I discovered someone with a solution (http://goo.gl/0WDEq). Get the best of both worlds here.”

 

Python and Giving Back Your Data

Flask

A key part of data science is giving back the data and the results to her or his audience in a meaningful way. While static documents such as PDF can be useful, who wouldn’t rather send a URL to a full blown web application, built to demonstrate and explore the data!  Luckily, Python has you covered and I am not just talking about Django (“the web framework for perfectionists with deadlines,”) which is relatively large and complex. I am talking about Flask, a microframework for Python that allows you to build web applications in a single source file. Want an example? Check out http://survey.datacommunitydc.org/ to see Flask in action.  Also, if you want a fantastic and indepth tutorial, check out this lengthy series of posts by Miguel Grinberg.

 

Python and Insane Number Crunching Power

Numba

This is a project that John had not heard of until PyData and was incredibly impressed. The basic idea is by adding type casts python code, numba will transform your python byte code and feed it to the LLVM compiler make mind melting speed improvements. Fans of julia know the impressive bench marks that the LLVM stack can provide. At a minimum  the library requires a single function decorators:

@autojit
def foo(bar):
    return bar * 2

With this “auto” version, the first time you call the function numba will inspect the type of ‘bar’ and compile a LLVM version of the function and link it to your python interpreter. You do get a speed hit the first time you compile but, every time you call your function, you call the linked compiled version. Obviously, this example is contrived but the speed examples given where very  impressive. A way to avoid this live-compile hit is to use the @jit decorator which requires you to specify before hand the types of arguments that you will pass in. While the underlying engineering looks incredibly complex, at least for this (John) humanities-major turned hacker, the simple API promises to open up speed improvements previously only available to hand tuned Cython code. Another Continuum contribution.

PyCUDA

The GPGPU (General-Purpose Graphics Processing Unit) programming movement has been fascinating to watch ever since 2001. Dollar for dollar, GPU’s offer an incredible amount of number crunching capability for the right (read easily parallelized) application.  To put this in perspective, nVidia’s new Titan gpu solution offers 4.5 terraflops of compute for $1,000 (2688 stream processors at 875MHz and 6 GB of 6GHZ GDDR5)!!! And now you can access that number crunching power from Python. PyCUDA gives you easy, Pythonic access to Nvidia‘s CUDA parallel computation API.

Blaze

Blaze is the next generation of Numpy, being developed by Continuum. “Blaze aims to extend the structural properties of NumPy arrays to a wider variety of table and array-like structures that support commonly requested features such as missing values, type heterogeneity, and labeled arrays.” Continuum was recently awarded a $3M DARPA grant to develop Blaze, so I’m sure we’ll see many good things in short order.

 

Visualization

We had already mentioned matplotlib in the previous post. There are two other libraries you might want to look at.

Mayavi

Mayavi

Mayavi is a sophisticated open-source 3-D visualization for Python, produced by Enthought. It depends on some other Enthought products, which are part of the Anaconda CE distribution

Bokeh

“Bokeh (pronounced boh-Kay) is an implementation of Grammar of Graphics for Python, that also supports the customized rendering flexibility of Protovis and d3. Although it is a Python library, its primary output backend is HTML5 Canvas”. Bokeh is trying  to be ggplot for Python. It is in active development in the Continuum stable.

 

The following two tabs change content below.

Sean Murphy

Senior Scientist and Data Science Consultant at JHU
Sean Patrick Murphy, with degrees in math, electrical engineering, and biomedical engineering and an MBA from Oxford, has served as a senior scientist at Johns Hopkins University for over a decade, advises several startups, and provides learning analytics consulting for EverFi. Previously, he served as the Chief Data Scientist at a series A funded health care analytics firm, and the Director of Research at a boutique graduate educational company. He has also cofounded a big data startup and Data Community DC, a 2,000 member organization of data professionals. Find him on LinkedIn, Twitter, and .
This entry was posted in Python, Statistical Programming DC and tagged , , . Bookmark the permalink.

One Pingback/Trackback