12-04, 14:00–14:30 (UTC), General Track
“I like waiting for my build jobs,” said no one ever. CI is an essential part of ensuring quality, helping to highlight new issues before they might be merged into the main codebase. CI gives us confidence that the code changes being proposed don’t break things, as least as far as our tests cover. That confidence comes at the cost of time and compute resources.
The RAPIDS team at NVIDIA manages its own operations and compute resources. Those resources are limited, of course, so we wait our turn and put the toys back when we’re done.. It is essential to us that we are using our resources as efficiently as possible. This is the “Speed of Light” principle at NVIDIA: how close are you to a theoretical optimal limit? For CI, this involves several factors: startup wait time, docker image setup time, cache utilization, build tool processes, and limiting unnecessary redoing builds and tests for things that haven’t changed. The RAPIDS team set out to add telemetry to all of our builds, so that we can quantify where we are spending our time and compute resources, and ensure that we are spending them wisely. We’ll demonstrate the telemetry tools that we’re using, and show how you can add them to your build jobs.
I’ll present an overview of the aggregated system of builds and resources that the RAPIDS team employs. The scale of this system has made it difficult to study in any holistic manner. We can say things about single jobs, but saying things about the larger system across repositories is difficult. On the other hand, we also lacked timing information within build processes, such as conda-build, which could help us understand possible improvements to those build tools.
Our telemetry is designed to observe our builds at multiple levels of complexity. At the broadest view, we’ll look at statistics for our worker pools, such as how long workers are taking to start jobs, and how well-matched our machine sizes are for their jobs. On a finer, per-project level, we’ll look at our caching performance, and whether there are particular configurations that may have aberrant behavior. At the finest level, we will utilize telemetry inside build tools to determine ways that we might be able to save time, such as pre-computing environments.
Fundamentally, our “Speed of Light” is limited to Amdahl’s law, but the idea of telemetry is to observe areas of our process where there may be “juice left to squeeze,” and prioritize the biggest potential wins. We hope to share what we’ve learned here to help others maximize the machines they use.
No previous knowledge expected
Michael started out as a chemist and electron microscopist, but went to the dark side of build infrastructure after experiencing the pain of trying to distribute the tools he wrote. He's been involved with Conda and Conda-forge for quite a while, and now works on the RAPIDS build infrastructure team at NVIDIA. He has a strange obsession with encoding compatibility into package management systems, and dreams of a world where no user ever has to wonder why they are missing symbols. Metadata is his love language.