Recently, the journal PLOS Computational Biology published an excellent article by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig entitled the “Ten Simple Rules for Reproducible Computational Research.” The list of ten rules below resonates strongly with my experiences both in computational biology and data science.
Rule 1: For Every Result, Keep Track of How It Was Produced
Rule 2: Avoid Manual Data Manipulation Steps
Rule 3: Archive the Exact Versions of All External Programs Used
Rule 4: Version Control All Custom Scripts
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
Rule 7: Always Store Raw Data behind Plots
Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
Rule 9: Connect Textual Statements to Underlying Results
Rule 10: Provide Public Access to Scripts, Runs, and Results
I highly recommend reading the full article in PLOS Computational Biology here.
The rules highlighted in bold above are those that seem to be part of an emerging movement to take best practices from software engineering and even devops and bring them into the more exploratory computational and data science. This trend was evident at Strata + Hadoop World 2013 in NY during Wes McKinney’s excellent talk, “Building More Productive Data Science and Analytics Workflows.”
More on this trend in future blog posts but, for a quick tease, go and check out make for data, Drake.