“Ten Simple Rules for Reproducible Computational Research” – An Excellent Read for Data Scientists

Recently, the journal PLOS Computational Biology published an excellent article by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig entitled the “Ten Simple Rules for Reproducible Computational Research.” The list of ten rules below resonates strongly with my experiences both in computational biology and data science.

Rule 1: For Every Result, Keep Track of How It Was Produced
Rule 2: Avoid Manual Data Manipulation Steps
Rule 3: Archive the Exact Versions of All External Programs Used
Rule 4: Version Control All Custom Scripts
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
Rule 7: Always Store Raw Data behind Plots
Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
Rule 9: Connect Textual Statements to Underlying Results
Rule 10: Provide Public Access to Scripts, Runs, and Results

I highly recommend reading the full article in PLOS Computational Biology here.

The rules highlighted in bold above are those that seem to be part of an emerging movement to take best practices from software engineering and even devops and bring them into the more exploratory computational and data science.  This trend was evident at Strata + Hadoop World 2013 in NY during Wes McKinney’s excellent talk, “Building More Productive Data Science and Analytics Workflows.

More on this trend in future blog posts but, for a quick tease, go and check out make for data, Drake.



The following two tabs change content below.

Sean Murphy

Senior Scientist and Data Science Consultant at JHU
Sean Patrick Murphy, with degrees in math, electrical engineering, and biomedical engineering and an MBA from Oxford, has served as a senior scientist at Johns Hopkins University for over a decade, advises several startups, and provides learning analytics consulting for EverFi. Previously, he served as the Chief Data Scientist at a series A funded health care analytics firm, and the Director of Research at a boutique graduate educational company. He has also cofounded a big data startup and Data Community DC, a 2,000 member organization of data professionals. Find him on LinkedIn, Twitter, and .
This entry was posted in Announcements, Community, Micro and tagged , , , , . Bookmark the permalink.

2 Pingbacks/Trackbacks