One of my greatest pet peeves in scientific computing

In my experience as a PhD student (and an undergraduate before that), I noticed that one of the largest sources of errors and lost time was code reuse via copy and paste. I'd need to do something, and instead of using a premade library or write it myself, I'd copy the bits of source code that I needed from my post-doc. Fifty percent of the time, it would work. The other half of the time, I would be hunting bugs for hours, days or even a week before it would compile or run correctly. Or even worse, I would realize something was wrong a month later when our physics results were obviously incorrect. And it’s not just me - now that I'm the senior student, I see all the younger students doing the exact same thing. 

As I've gotten more experienced and spent more time learning "good" coding practices, I've realized a few things. Firstly, blindly copying code is a great way to introduce bugs in any scenario, but particularly in scientific analysis. We tend to have many more free/adjustable parameters per line of code than a more general library or framework you'd see in the larger programming world, and they can quickly get away from your control. The second thing I've learned is that the power of good abstractions and a well defined, sane user interface for a library is universally underestimated by novice coders, and probably by many more experienced coders as well. 

We as experimental high energy/nuclear physicists tend to use almost exclusively C++ (although we're dipping our toes into python) for a few reasons. We certainly need all the performance we can get - we deal with multi-terabyte to petabyte datasets - and also due to history; our foundational analysis tool as experimentalists is ROOT (https://root.cern.ch/), which is a C++ library. ROOT is a great set of libraries that provide mathematical tools for functions, histograms, fitting and statistical analyses, as well as high performance data structures, storage and I/O. One thing that we are missing that would be quite powerful, however, is a coherent and correct way to share parts of analyses to be reused. 

I'm hoping to provide at least one option to help this. I am writing a c++ library that I am calling scaffold (for reasons that should be clear in the next sentence). The goal of scaffold is to be able to define an analysis in terms of unit operations that can be combined into a graph structure that can then be executed on arbitrary input data, and allow these unit operations or subgraphs and their relevant parameters to be correctly and painlessly shared. 

Scaffold is in the *very* early stages, so I am not sure if I have hit on a workable strategy yet. But I hope to have another post in a few weeks/months, where I can explain how the library is coming along, or where I am getting stuck :) Stay Tuned! 


Nick