Python vs R vs SPSS … Can’t All Programmers Just Get Along?

Programmers have long been very proud and loyal with their tools, and often very vocal. This has led to well-contested rivalries and “fights” about which tool is better:

  • emacs or vi;
  • Java or C++;
  • Perl or Python;
  • Django or Rails;
  • and, for data geeks, the SAS/SPSS/R/Matlab fight.

file000890717941

The truth is, very few of us data geeks (data scientists, data analysts, statisticians, or what ever we call ourselves [editor note: Data Practitioners]) use only a single tool for all of our work. We will often extract data from a SQL database, munge it using Perl or Python, and then do statistical analysis using R or SAS, reporting the results using Word or, increasingly, the web. Specially for data analysis, there is often no single tool that can do the end-to-end workflow well, however much we would like to believe that there is. Each tool has its strengths and weaknesses, and often a mixture works best. The trick is in finding the right “glue” that can string our workflow together.

There are now several interface packages available to talk between open-source languages. I’ll speak to the interfaces with R, which I’m most familiar with, but I’m sure that the community will point out other useful interfaces. R is not the fastest nor most elegant of languages, but has by far the richest ecosystem of cutting-edge data analysis packages. There are now ways to communicate with R from other general programming languages like Java (through the rJava package and JNI), Perl (Statistics::R, available in CPAN),  Python (rpy2, PypeR, available in PyPI). Packages in R allow communication out with general packages, like RSPython, RSPerl (both available at Omegahat) and rJava. Most commercial statistical packages, like SAS, SPSS and Statistica allow you to write R code to send to R and then get back the results. A specially nice SAS macro to do this for those without the latest versions of SAS is %Proc_R, available here. One can also call R from Matlab. There are also many ways of interfacing with R using web-based tools like Rserve or, on Windows, the rcom interface to utilize COM and connect with, among other things, Word and Excel.

More recently I have been excited about platforms where code can be written in different languages and integrated using literate programming (i.e., the weaving of the results of code with text to create reports).

  • Babel is a part of org-mode in Emacs which allows different programming languages to be used in the same document to perform an analysis and report. There are several examples of how this is done.
  • The latest IPython distribution now allows you to integrate other languages using user-contributed magic functions. The initial languages available are R, Octave and, very recently, Julia. The first two are already integrated into IPython. Using these magic functions, you can use the power of R, Octave and Julia along with all the tools available in Python like Numpy, Scipy, matplotlib, pandas and the like on one platform. Literate programming is easily achieved through the excellent HTML notebook that is now part of IPython distributions.
    Update: A sql magic function was just added to the ecosystem.

The interfacing tools I’ve described now allows us to create a greater ecosystem where different tools can be integrated to a common goal rather easily. Instead of fighting over
which tool is better, we’re now going to a place where that doesn’t matter; what matters is being able to use the right tools for each piece of the job and getting the tools working together to do the best job possible. We can, after all, all get along.

PS: For translating code between Matlab/Octave, Python and R, there is a great little site called Mathesaurus.

 

(Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image in the post and buy the book, we will make approximately $0.43 and retire to a small island).

The following two tabs change content below.
Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting. He has a PhD in Biostatistics from the University of Washington and over 40 collaborative peer-reviewed manuscripts, with strong interests in bridging the statistics/machine learning divide. He is always is on the lookout for interesting and challenging projects, and is an enthusiastic speaker and discussant on new and better ways to look at and analyze data. He is a Board Member of Data Community DC and a founding member and co-organizer of Statistical Programming DC (formerly R Users DC).
This entry was posted in DataBlog, Python, R and tagged , , , , , . Bookmark the permalink.

One Pingback/Trackback

  • buggyfunbunny

    I’d add, under the general heading of: “where’s my data?”, that PL/R (and a few other similar) allows one to run R from within Postgres (and it’s extended versions). The ability to embed within the database, where the data is stored, is a good thing. While the various versions of SQL standard support elementary stat functions, nothing beats doing it right.

  • http://twitter.com/richierocks Richie Cotton

    Being able to write code in any language and have it all magically work together was, of course, a major aim of Microsoft’s .NET. Not that that helps data analysts much, since the stats and numerics capabilities built into the .NET Framework are fairly rudimentary.

    Of course, if Microsoft put a bit more effort into IronPython, or if they’d successfully bought The Mathworks (TMW turned them down a few years ago), it might be a different story.

  • Majid al-Dosari

    I was just thinking of this problem for my data viz project! Lots of tools that think they are the center of the universe. However, I’ve concluded that Python is the best glue language. Unfortunately there seems to be a split b/w the sci&eng community and the stat community. The stats people would like to have R as the center of their universe, while the sci&eng people would like to have Python as the center of their universe. As a scientist and engineer, I’m going to treat R as nothing more than a library.

    ..which brings me to this: Could someone please comment on a technical or syntactic reason why R is a better language for expressing statistics than Python?

    • Sean Murphy
    • http://twitter.com/webbedfeet Abhijit Dasgupta

      As Sean points out, until recently, python didn’t have the data structures to efficiently handle data sets of multiple types (R’s data.frame object). Pandas has filled that void to a great extent, but the flexibility of R’s list structures and the apply class of functions for that make data manipulation and munging much easier, something in my experience Python lacks. The engg community leverages the Numpy/Scipy/matplotlib stack which is purely a numeric stack. R covers more than numerical data. You can handle categorical and string data with equal felicity as numeric data, and actually use them in models pretty easily. Python doesn’t have that flexibility yet. Try importing a dataset which has text fields, numeric fields, and categorical objects into Python easily and store it in a single object (without Pandas). With R, that is a one-liner.

      R can be used through Python, as mentioned in the article.

      The statistician’s view is R centric, merely because R has developed an ecosystem and toolset that makes the statistician’s job much easier, much more so than the current Python stack, for diverse types of data analysis. R-centric, not R as be all and end all. Many of us do use Perl, Python, Java, Javascript, and even C++ to develop our code and munge our data. R just provides the core ecosystem which gives us most of the tools we need.

      • Majid al-Dosari

        numpy is not purely numeric. it’s had the ability to handle strings. you’ve also missed numpy’s structured array where you can create an array of arbitrary data types. i can easily argue that this is more flexible than R b/c it’s arbitrary and I can implement any structure i want and that includes dimensionality. this was available before pandas.

  • Majid al-Dosari

    Here is a series of posts that are VERY relevant to this discussion http://slendrmeans.wordpress.com/will-it-python/ . He is translating exercises in R to Python and commenting on the strengths and weaknesses of each. What a clever idea!

  • Pingback: SEP Blog Tools of the Trade | SEP Blog