Rapid deduplication and fuzzy matching of large datasets using Splink PyData Global 2024

Rapid deduplication and fuzzy matching of large datasets using Splink
.ical

12-04, 16:30–17:00 (UTC), Data/ Data Science Track

Data deduplication is a ubiquitous data quality problem that most data people will encounter at some point in their career. It happens whenever multiple records are collected about the same person or other entity without a unique identifier that ties these records together.

This talk provides beginners with everything they need to start linking and deduping large datasets using Splink, a free Python library.

Until recently, no free tools existed that could link and deduplicate datasets of many millions of records quickly and accurately. Splink is designed to solve this problem - enabling datasets to be deduplicated and enriched by linking to other data sources.

This talk will focus on the key concepts and techniques beginners need to start solving practical data linkage problems, with illustrations using the Splink library.

Attendees will learn how to train linkage models effectively, and how large linkages can be achieved on a laptop with surprisingly few lines of code using the DuckDB backend. It assumes knowledge of basic tabular data use in Python.

A high level structure of the talk is as follows:

Why is data duplication a common problem
Why is probabilistic linkage more powerful than other approaches like fuzzy matching?
How linkage models can be trained using unsupervised learning
Best practices and pitfalls to optimise speed and accuracy
Visualising results and diagnostics
Performance benchmarking results
Examples of real life uses of Splink

Splink is developed by the UK Ministry of Justice and has been used widely by governments, academics, and companies around the world.

Prior Knowledge Expected –

No previous knowledge expected

Robin Linacre

Robin Linacre is a data scientist at the UK Ministry of Justice and the lead author of Splink, a Python library for record linkage and deduplication at scale

Rapid deduplication and fuzzy matching of large datasets using Splink .ical 12-04, 16:30–17:00 (UTC), Data/ Data Science Track

Rapid deduplication and fuzzy matching of large datasets using Splink
.ical

12-04, 16:30–17:00 (UTC), Data/ Data Science Track