High-Performance Computing in R Workshop

Data Community DC and District Data Labs are excited to be hosting a High-Performance Computing with R workshop on June 21st, 2014 taught by Yale professor and R package author Jay Emerson.

If you’re interested in learning about high-performance computing including concepts such as memory management, algorithmic efficiency, parallel programming, handling larger-than-RAM matrices, and using shared memory this is an awesome way to learn!

To reserve a spot, go to http://bit.ly/ddlhpcr.

Overview
This intermediate-level masterclass will introduce you to topics in high-performance computing with R. We will begin by examining a range of related topics including memory management and algorithmic efficiency. Next, we will quickly explore the new parallel package (containing snow and multicore). We will then concentrate on the elegant framework for parallel programming offered by packages foreach and the associated parallel backends. The R package management system including the C/C++ interface and use of package Rcpp will be covered. We will conclude with basic examples of handling larger-than-RAM numeric matrices and use of shared memory. Hands-on exercises will be used throughout.

What will I learn?
Different people approach statistical computing with R in different ways. It can be helpful to work on real data problems and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal presentation without the distraction of a complicated applied problem. This course offers four distinct modules which adopt both approaches and offer some overlap across the modules, helping to reinforce the key concepts. This is an active-learning class where attendees will benefit from working along with the instructor. Roughly, the modules include:

An intensive review of the core language syntax and data structures for working with and exploring data. Functions; conditionals arguments; loops; subsetting; manipulating and cleaning data; efficiency considerations and best practices, including loops and vector operations, memory overhead and optimizing performance.

Motivating parallel programming with an eye on programming efficiency: a case study. Processing, manipulating, and conducting a basic analysis of 100-200 MB of raw microarray data provides an excellent challenge on standard laptops. It is large enough to be mildly annoying, yet small enough that we can make progress and see the benefits of programming effiency and parallel programming.

Topics in high-performance computing with R, including packages parallel and foreach. Hands-on examples will help reinforce key concepts and techniques.

Authoring R packages, including an introduction to the C/C++ interface and the use of Rcpp for high-performance computing. Participants will build a toy package including calls to C/C++ functions.

Is this class right for me?
This class will be a good fit for you if you are comfortable working in R and are familiar with R’s core data structures (vectors, matrices, lists, and data frames). You are comfortable with for loops and preferably aware of R’s apply-family of functions. Ideally you will have written a few functions on your own. You have some experience working with R, but are ready to take it to the next level. Or, you may have considerable experience with other programming languages but are interested in quickly getting up to speed in the areas covered by this masterclass.

After this workshop, what will I be able to do?
You will be in a better position to code efficiently with R, perhaps avoiding the need, in some cases, to resort to C/C++ or parallel programming. But you will be able to implement so-called embarassingly parallel algorithms in R when the need arises, and you’ll be ready to exploit R’s C/C++ interface in several ways. You’ll be in a position to author your own R package can include C/C++ code.

All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.

What do I need to bring?
You will need your laptop with the latest version of R. I recommend use of the R Studio IDE, but it is not necessary. A few add-on packages will be used in the workshop. Packages Rcpp and foreach will be used. As a complement to foreach you should also install doMC (Linux or MacOS only) and doSNOW(all platforms). If you want to work along with the C/C++ interface segment, some extra preparation will be required. Rcpp and use of the C/C++ interface requires compilers and extra tools; the folks at RStudio have a nice page that summarizes the requirements. Please note that these requirements may not be trivial (particularly in Windows) and need to be completed prior to the workshop if you intend to compile C/C++ code and use Rcpp during the workshop.

Instructor
John W. Emerson (Jay) is Director of Graduate Studies in the Departmentof Statistics at Yale University. He teaches a range of graduate and undergraduate courses as well as workshops, tutorials, and short courses at all levels around the world. His interests are in computational statistics and graphics, and his applied work ranges from topics in sports statistics to bioinformatics, environmental statistics, and Big Data challenges.

He is the author of several R packages including bcp (for Bayesian change point analysis), bigmemory and sister packages (towards a scalable solution for statistical computing with massive data), and gpairs (for generalized pairs plots). His teaching style is engaging and his workshops are active, hands-on learning experiences.

You can reserve your spot by going to http://bit.ly/ddlhpcr.

The following two tabs change content below.

Tony Ojeda

Manager, Data Analysis and Strategic Solutions at Follett Higher Education Group
Tony Ojeda is an accomplished data scientist and entrepreneur with expertise in business process optimization and over a decade of experience creating and implementing innovative data products and solutions. He has a Masters in Finance from Florida International University and an MBA with concentrations in Strategy and Entrepreneurship from DePaul University. He is the founder of District Data Labs, a co-founder of Data Community DC, and is actively involved in promoting data science education through both organizations.
This entry was posted in Events, R, Resources, Tutorials and tagged , , , . Bookmark the permalink.

One Pingback/Trackback