MIDAS Reproducibility Showcase
I was recently on a small team of Glotzer Lab members (peeps, as we say) that competed in the MIDAS Reproducibility Challenge. The purpose of this challenge was to “highlight quality, reproducible work at the University of Michigan by collecting examples of best practices across diverse fields.” We prepared and submitted a report highlighting our efforts in this arena. Our submission was selected to present at the Reproducibility Showcase, and we gave an approximately 45 minute talk on our group’s approach to reproducibility. While the group’s full software stack promotes reproducibility through our professional software engineering practices and integration with the scientific Python ecosystem, signac is the project that most directly addresses this issue.
Our talk was organized as follows: Joshua Anderson gave an introduction to our software stack and software development practices, then I gave a short introduction to signac, followed by two case studies of research projects that represent our group’s efforts towards reproducible science. My section was short, 7 minutes, so I basically had enough time to introduce the signac framework and discuss how signac-flow promotes reproducible computational research. This was my first time giving a talk on signac, so despite the brevity of my talk, I found it difficult to put together exactly what I wanted to say. But as we all know, insight generally lies behind difficulty, and that was certainly the case here.
I had two major revelations while putting together and giving this talk. And by revelations, I mean things that I’ve known at a surface level but now fully appreciate. Or, as they say, my knowledge grew into wisdom. First, I more fully appreciate the ingenuity of signac’s workspace organization on the file system. By hashing the state point into a unique id for each job, you can truly manage an extremely heterogeneous data space with no additional overhead from the complexity of that data space. In principle, you could manange an entire Ph.D. worth of data in a single signac project (just because you can does not mean you should, but you certainly can). As an added benefit of this organization, your project’s metadata is stored in a completely human readable format — no more trying to parse directory paths for metadata.
My second realization involves TRUE molecular simulations. There is a push within the molecular simulation community for simulations are TRUE, that is, Transparent, Reproducible, Usable by others, and Extensible. I realized as I was talking (literally as I was giving the talk) that not only does signac make TRUE simulations easier to achieve, it actually makes it difficult to run simulations that aren’t TRUE. Once a signac-managed computational research project is completed, the researcher can deposit the project and workspace in a data repository (e.g., U-M’s Deep Blue). The indexing, searching, and filtering capabilities of signac then make the data both transparent and usable by others. The computational workflow defined by the signac-flow project means the project is completely reproducible (given enough computational resources, if necessary). And finally, since signac fits nicely into the scientific Python ecosystem, it is straightforward to extend the project. Hence, by using signac and signac-flow (name a more iconic duo… I’ll wait), you essentially get TRUE simulations for free.
All in all, this was a good experience and I am glad I participated. Given the ongoing reproducibility crisis and the über-relevant role that scientists play in the public’s response to crises, it is imperative that all researchers strive for reproducibility. For computational work, signac minimizes the overhead of this challenge, and as a result I am proud to be a part of a research group that makes this pursuit a top priority.