12-03, 20:00–20:30 (UTC), Data/ Data Science Track
Data scientists in the real world have to manage messy datasets that evolve over time. New data must be added, old data must be removed and changes to columns must be handled gracefully. Furthermore, many real world datasets grow from a size that works on a laptop to a size that must run on a server. This talk will show that in Python we can meet all these challenges in a simple and scalable way using the delta-rs package to manage the data storage and Polars to read and write the dataset.
In the talk we begin by introducing two key concepts: firstly we meet the Parquet file format and see why it is so well-suited for data analytics; secondly we meet the lakehouse as a way of managing datasets with Parquet files in tables regardless of whether the data is stored on your laptop or in the cloud.
We introduce Polars and see why it enables scalable data analysis in a way that Pandas does not. We then introduce the delta-rs lakehouse library. We see how delta-rs implements the lakehouse concept using just Parquet and JSON files. We also see how delta-rs takes care of real-world data management challenges such as appending new data to the dataset, handling changes in columns and time-travelling back to earlier versions of our dataset.
We then show how well these two libraries work together by building a simple machine learning pipeline. We see how we can write simple queries and let the Polars query optimiser greatly reduce the time required to read even from a large dataset.
No previous knowledge expected
Liam is Lead Data Scientist at Joulen where he builds time series forecasting pipelines for renewable energy management. He communicates about cutting-edge data science with over 10,000 followers on social media. Liam has been a Polars contributor focused on accessibility and documentation for new users. He also created the world's first online course in Polars and has taught over 3,000 learners to date on Udemy and is the Polars instructor on the O'Reilly platform.