12-05, 12:00–12:30 (UTC), Data/ Data Science Track
Changing data is hard: The computer may crash, scripts could fail, and data structures could be changing. Relational data management systems provide transactional (“ACID”) guarantees that can be immensely useful for data analysis. DuckDB provides all-or-nothing semantics for changes to datasets and is robust against failures of any kind. In this talk, we will illustrate the usefulness DuckDB’s transactional facilities to bring sanity to changes to data analysis workflows in Python.
“Everything changes and nothing stays the same”. Yet somehow, when dealing with datasets, we often consider change as merely an afterthought. But very quickly, the world moves on, and the dataset needs to catch up to remain useful. Rows have to be inserted, deleted or updated. Often, changes are interconnected: A row in a table that maps orders to customers may not be very useful without the corresponding entry in the orders table. Schemas change, too, columns are added or removed and data types change. As a data management environment, managing change is thus not optional. However, managing changes correctly is difficult. The Python data stack so far has no good solution for changing data. All-too-common are the wild collections of CSV and Parquet files that are somehow derived from each other, that get overwritten because something failed halfway through the ingestion of next week’s data. We can do better.
DuckDB is a novel Open-Source data management system specifically designed to run analytical SQL queries. DuckDB is special because it is deployed in-process meaning that the entire data management system runs within the “host” process. DuckDB is deeply integrated with Python: DuckDB can read and write Pandas or Arrow data structures. DuckDB can execute Python functions as User-Defined Functions. DuckDB also provides full transactional (“ACID”) guarantees by default without additional configuration. This means that datasets can be changed with all-or-nothing semantics, while keeping constraints intact, and from multiple concurrent users at the same time. DuckDB’s transactional machinery is uniquely designed to efficiently support large changes to already large datasets.
In my hands-on and hopefully informative talk, I will show the benefits of transactional semantics to data analysis workflows and how DuckDB is uniquely positioned to solve those common problems. To do so, I will walk through a series of real-world examples of changes to data and show how using formal transactions leads to a superior experience. The audience will hopefully find appreciation for safe yet efficient changes to data, and be able to apply this in their daily lives. The talk is aimed at general data science and data engineering practitioners, no prior knowledge is required.
No previous knowledge expected
Prof. Dr. Hannes Mühleisen is a creator of the DuckDB database management system and Co-founder and CEO of DuckDB Labs. He is a senior researcher at the Centrum Wiskunde & Informatica (CWI) in Amsterdam. He is also Professor of Data Engineering at Radboud University Nijmegen.