12-05, 17:30–19:00 (UTC), Data/ Data Science Track
The goal of this tutorial is to make Gaussian processes (GPs) useful. In most practicing data scientists' mental map of modeling and machine learning techniques, Gaussian processes are an advanced approach that sit alone on an island, perhaps with narrow use cases like Bayesian optimization. Most books and other material on GPs tend to focus on theoretical aspects, and it can be hard to close the gap between the theory and putting those ideas into practice to solve real problems in a reasonable amount of time.
This tutorial is split into two parts. The first part introduces Bayesian modeling, focusing on hierarchical modeling and the concept of partial pooling. We’ll use the classic example of estimating the batting average of a group of baseball players as motivation. Then we’ll introduce GPs as a useful generalization of hierarchical modeling for the common situation where our groups aren’t distinct categories. Instead of thinking of each baseball player as completely distinct and exchangeable entities, we can use a GP to partially pool information locally by also considering each player's age. Finally we’ll close the first part by connecting back to the more common introduction to GPs as infinite dimensional multivariate normals.
The second part of the tutorial will give an overview of practical tips and tricks for modeling with GPs using the open source Python package PyMC. Specifically, how to address the two big issues to using GPs in practice: scaling and identifiability. We’ll discuss useful approximations like the HSGP and when to apply them, advice on when to use splines, and finally when you need to step out of a PPL like PyMC or Stan to a GP specific library like GPFlow or GPyTorch. We’ll do so with a couple motivating examples. The audience should have some familiarity with basic ML and statistics concepts, such as probability distributions, normal and multivariate normal distributions, correlation and covariance, and linear regression - but the talk will aim to be non-technical and the goal will be introduce GPs and give people the tools they need to use them effectively in practice.
We’ll spend the first 15 minutes laying the groundwork for the tutorial. First we’ll introduce Bayesian modeling starting with conditional probability and Bayes theorem. We’ll also introduce the motivating example that we’ll return to throughout the tutorial, which is to estimate the early season batting average of each player on a baseball team. We’ll close the introduction by describing PyMC, which we’ll use to do our modeling, and its computational backend PyTensor. This is a good opportunity to note the advances made in recent years to the capabilities of PyTensor, such as strong GPU support via the JAX backend, and compatibility with a number of newer samplers such as Nutpie and Blackjax.
Then (15 minutes) we introduce Bayesian hierarchical modeling. We compare the initial baseball example, where every player's batting average was estimated individually (no pooling), to the complete pooling case which estimates an aggregate batting average for the whole team (complete pooling). Bayesian hierarchical modeling gives us the best of both worlds, allowing us to improve the estimates of each player's batting average by taking into account the regularizing effect of the estimates from the other players on the team (partial pooling).
We spend the next 15 minutes motivating Gaussian processes as an extension of hierarchical modeling. In order to use hierarchical modeling, we need to treat each player as exchangeable. However, baseball players tend to peak in performance at age 26 and then decline. If we take their age into account, the players are no longer exchangeable. Instead of partially pooling information across all players equally, we’d like our model to partially pool batting average in a way that takes age into account.
The next segment (15 minutes) describes the role the covariance function plays in GPs. We show that the no pooling model can equivalently be represented as a GP with the identity covariance, and the complete pooling model can be represented as a GP with a constant covariance. To partially pool by age, we need to design a covariance function that gives higher covariance to players with similar ages, motivating us to re-discover the exponentiated quadratic covariance function with a learnable parameter called the lengthscale, which is responsible for controlling the age range we partially pool over.
Then (10 minutes), we connect this construction of a Gaussian process to its more common introduction, as an infinite dimensional prior probability distribution over unknown functions. To do so, we’ll recast our problem as estimating batting average as a function of age. We’ll spend a little time discussing how these two views are equivalent. We’ll close this section by extending our GP to allow multi-dimensional inputs, by incorporating height and weight.
The final segment (20 minutes) will cover advice for using GPs in practice. We use the HSGP approximation throughout this section. The choice of the prior on the lengthscale and the form of the covariance function can make or break a model in practice. Particularly in settings where there isn’t much data, GPs are very sensitive to choice of prior. We describe why heavy tailed priors like the inverse-gamma or log-normal work so well, and also show how to implement the penalized complexity (PC) prior. We also discuss the importance of the choice of covariance function, and talk about the Matern family. Then, we describe identifiability issues and how to resolve them when there’s an intercept present in the model, and the connection to our no pooling baseball example when formulated as a GP. Finally we briefly discuss the scaling problem of GPs, both in data set size and covariance function complexity, and how these two issues motivate the HSGP approximation and GP specific libraries like GPFlow and GPyTorch.
No previous knowledge expected
Bill Engels is a Principal Data Scientist with PyMC Labs, with 10 years of experience in industry and an MS in Statistics from Portland State University. He enjoys all phases of data analysis and is particularly interested in Bayesian modeling and Gaussian processes.