One of my greatest pet peeves in scientific computing

In my experience as a PhD student (and an undergraduate before that), I noticed that one of the largest sources of errors and lost time was code reuse via copy and paste. I'd need to do something, and instead of using a premade library or write it myself, I'd copy the bits of source code that I needed from my post-doc. Fifty percent of the time, it would work. The other half of the time, I would be hunting bugs for hours, days or even a week before it would compile or run correctly. Or even worse, I would realize something was wrong a month later when our physics results were obviously incorrect. And it’s not just me - now that I'm the senior student, I see all the younger students doing the exact same thing. 

As I've gotten more experienced and spent more time learning "good" coding practices, I've realized a few things. Firstly, blindly copying code is a great way to introduce bugs in any scenario, but particularly in scientific analysis. We tend to have many more free/adjustable parameters per line of code than a more general library or framework you'd see in the larger programming world, and they can quickly get away from your control. The second thing I've learned is that the power of good abstractions and a well defined, sane user interface for a library is universally underestimated by novice coders, and probably by many more experienced coders as well. 

We as experimental high energy/nuclear physicists tend to use almost exclusively C++ (although we're dipping our toes into python) for a few reasons. We certainly need all the performance we can get - we deal with multi-terabyte to petabyte datasets - and also due to history; our foundational analysis tool as experimentalists is ROOT (https://root.cern.ch/), which is a C++ library. ROOT is a great set of libraries that provide mathematical tools for functions, histograms, fitting and statistical analyses, as well as high performance data structures, storage and I/O. One thing that we are missing that would be quite powerful, however, is a coherent and correct way to share parts of analyses to be reused. 

I'm hoping to provide at least one option to help this. I am writing a c++ library that I am calling scaffold (for reasons that should be clear in the next sentence). The goal of scaffold is to be able to define an analysis in terms of unit operations that can be combined into a graph structure that can then be executed on arbitrary input data, and allow these unit operations or subgraphs and their relevant parameters to be correctly and painlessly shared. 

Scaffold is in the *very* early stages, so I am not sure if I have hit on a workable strategy yet. But I hope to have another post in a few weeks/months, where I can explain how the library is coming along, or where I am getting stuck :) Stay Tuned! 


Nick

Lets Learn About A.I.

Increasingly, machines are making decisions for us every day. Telling me when to leave to get to work on time, detecting cancer, and improving google search results are just a small sample of places in industry & academia where these algorithms are being developed and deployed. A.I. shows up in the news with greater frequency each year, and large corporations are spending massive amounts on talent acquisition and research. And yet, it seems like the public doesn’t understand what these algorithms are. In fact, scientists and engineers who don't work in the field don't fully understand (myself included). 

 This is a podcast for people who want to understand. I will try to guide us through this very vibrant and active field of research, starting with the basics, and working our way through to the very state of the art. So buckle in, and get ready to learn together.