PyData Global 2024

Rapid deduplication and fuzzy matching of large datasets using Splink
12-04, 16:30–17:00 (UTC), Data/ Data Science Track

Data deduplication is a ubiquitous data quality problem that most data people will encounter at some point in their career. It happens whenever multiple records are collected about the same person or other entity without a unique identifier that ties these records together.

This talk provides beginners with everything they need to start linking and deduping large datasets using Splink, a free Python library.


Until recently, no free tools existed that could link and deduplicate datasets of many millions of records quickly and accurately. Splink is designed to solve this problem - enabling datasets to be deduplicated and enriched by linking to other data sources.

This talk will focus on the key concepts and techniques beginners need to start solving practical data linkage problems, with illustrations using the Splink library.

Attendees will learn how to train linkage models effectively, and how large linkages can be achieved on a laptop with surprisingly few lines of code using the DuckDB backend. It assumes knowledge of basic tabular data use in Python.

A high level structure of the talk is as follows:

  • Why is data duplication a common problem
  • Why is probabilistic linkage more powerful than other approaches like fuzzy matching?
  • How linkage models can be trained using unsupervised learning
  • Best practices and pitfalls to optimise speed and accuracy
  • Visualising results and diagnostics
  • Performance benchmarking results
  • Examples of real life uses of Splink

Splink is developed by the UK Ministry of Justice and has been used widely by governments, academics, and companies around the world.


Prior Knowledge Expected

No previous knowledge expected

Robin Linacre is a data scientist at the UK Ministry of Justice and the lead author of Splink, a Python library for record linkage and deduplication at scale