PyData Global 2024

Quan Nguyen

Quan is a Python programmer and machine learning enthusiast. He is interested in solving decision-making problems that involve uncertainty. Quan has authored several books on Python programming and scientific computing. He is currently working as a postdoctoral research associate at Princeton University, where he does research on machine learning methods for scientific discovery.

The speaker's profile picture

Sessions

12-03
16:30
30min
Cost-effective data annotation with Bayesian experimental design
Quan Nguyen

Unlike stylized machine learning examples in textbooks and lectures, data are often not readily available to be used to train models and gain insight in real-world applications; instead, practitioners are required to collect those data themselves.
However, data annotation can be expensive (in terms of time, money, or some safety-critical conditions), thus limiting the amount of data we can possibly obtain.
(Examples include eliciting an online shopper's preference with ads at the risk of being intrusive, or conducting an expensive survey to understand the market of a given product.)
Further, not all data are created equal: some are more informative than others.
For example, a data point that is similar to one already in our training set is unlikely to give us new information; conversely, a point that is different from the data we have thus far could yield novel insight.
These considerations motivate a way for us to identify the most informative data points to label and gain knowledge in a way that makes use of our labeling budget as effectively as possible.
Bayesian experimental design (BED) formalizes this framework, leveraging the tools from Bayesian statistics and machine learning to answer the question: which data point is the most valuable that should be labeled to improve our knowledge?

This talk serves as a friendly introduction to BED including its motivation as discussed above, how it works, and how to implement it in Python.
During our discussions, we will show that interestingly, binary search, a popular algorithm in computer science, is a special case of BED.
Data scientists and ML practitioners who are interested in decision-making under uncertainty and probabilistic ML will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, and common probability distributions (normal, uniform, etc.).

Data/ Data Science Track
Data/ Data Science Track