Stepping up to Big Data with R and Python: A Mind Map of All the Packages You Will Ever Need

On May 8, we kicked off the transformation of R Users DC to Statistical Programming DC   (SPDC) with a meetup at iStrategyLabs in Dupont Circle. The meetup, titled “Stepping up to big data with R and Python,” was an experiment in collective learning as Marck and I guided a lively discussion of strategies to leverage the “traditional” analytics stack in R and Python to work with big data.

Rlogo               python-logo

R and Python are two of the most popular open-source programming languages for data analysis. R developed as a statistical programming language with a large ecosystem of user-contributed packages (over 4500, as of 4/26/2013) aimed at a variety of statistical and data mining tasks. Python is a general programming language with an increasingly mature set of packages for data manipulation and analysis. Both languages have their pros and cons for data analysis, which have been discussed elsewhere, but each is powerful in its own right. Both Marck and I have used R and Python in different situations where each has brought something different to the table. However, since both ecosystems are very large, we didn’t even try to cover everything, and we didn’t believe that any one or two people could cover all the available tools. We left it to our attendees (and to you , our readers) to fill in the blanks with favorite tools in R and Python for particular data analytic tasks.

There are several basic tasks we covered in the discussions: data import, visualization, MapReduce, parallel processing. We noted that, since R is becoming one of the lingua statististica, many commercial products by SAP, Oracle, Teradata, Netezza and the like have developed interfaces to allow R as an analytic backend. However, Python has been used to develop integrated analysis platforms due to its strengths as a “glue language” and its robust general capabilities and web development packages.

Most data scientists have had experience with small to medium data. Big Data poses its own challenges in terms of its size. Marck made the great point that Big Data is almost never directly used, but is aggregated and summarized before being analyzed, and this summary data is often not very big. However, we do need to use available tools a bit differently to deal with large data sizes, based on the design choices R and Python developers have made. R has  a earned reputation for not being about to handle datasets larger than memory, but users have developed useful packages like ff and bigmemory to handle this. In our experience, Python reads data much more efficiently (orders of magnitude) than R, so reading data with Python and piping it to R has often been a solution. Both R and Python have well established means of communicating with Hadoop, mainly leveraging Hadoop Streaming. Both also have well-developed interfaces to connect with both SQL-based and NoSQL databases. There was a lively discussion of various issues regarding using Big Data within R and Python, specifically in regards to Hadoop.

There is a basic stack of packages in both R and Python for data analysis, and many more packages for other analytic tasks. Both software platforms have huge ecosystems; so, to try and get you started on discovering many of the tools available for different data scientific tasks, we have developed preliminary maps of each ecosystem (click for a larger view, outlines with links, and to download):

R for big data

Python for big data

In fact, R can be used from within Python using the rpy2 package by Laurent Gautier, which has been nicely wrapped in the rmagic magic function in ipython. This allows R to be used from within an ipython notebook. (PS: If you’re a Python user and are not using ipython and the ipython notebook, you really should look into it). There are several ways of integrating R and Python into unified platforms, as I’ve described earlier.

Our meetup, and the maps above, are intended as a launching pad for your exploration of R and Python for your data analysis needs. We will have video from this meetup available soon (stay tuned). Resources for learning R are widely available on the web. We have described Python’s capabilities for data science and data analysis in earlier blog posts, and Ben Bengfort has a series of posts on using Python for Natural Language Processing, one of it’s analytic strengths. We hope that you will contribute to this discussion in the comments, and we will compile different tools and strategies that you suggest in a future post.

 

The following two tabs change content below.
Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting. He has a PhD in Biostatistics from the University of Washington and over 40 collaborative peer-reviewed manuscripts, with strong interests in bridging the statistics/machine learning divide. He is always is on the lookout for interesting and challenging projects, and is an enthusiastic speaker and discussant on new and better ways to look at and analyze data. He is a Board Member of Data Community DC and a founding member and co-organizer of Statistical Programming DC (formerly R Users DC).
This entry was posted in Community, Python, R, Resources, Statistical Programming DC and tagged , , , . Bookmark the permalink.

One Pingback/Trackback

  • disqus_0DJY9tUnaA

    May you consider data.table for big data? It’s fast and efficient.

    • Abhijit Dasgupta

      Good one. data.table should definitely be in the list

  • Patrick Durusau

    Great maps but the background reduces readability. Any thoughts on a version sans the background?

    Thanks!

    Patrick

  • Seth @ FBT

    This is a very informative introduction. Thanks a lot Abhijit. I’ll be looking forward to more such posts from you.

  • Pingback: Climate Analysis for Building Design | funature blog

  • Dre Peters

    I love the information contained here, especially the mind map picture that shows us things at a glance. I’ve been searching for something like it for long. It shows a full grasp of the languages by you.

    Thumbs up for this nice job. I love it. I’d like to ask, do you advice one sticking to Python if he can already use it for data analysis or still learn R?

    • Sean Patrick Murphy

      The Python data ecosystem keeps growing by the day and is getting close to offering feature parity with R. Personally, if you have the spare time, I would skip R and play with something more interesting, like Julia.

      • Dre Peters

        OK, thanks. I’ll check it out.

  • Martijn Theuwissen

    Regarding tools to learn R: check also http://www.datacamp.com. Free interactive introduction to R course