PyData Global 2024

Scaling Outside the Warehouse Using DuckDB and Python
12-05, 19:30–20:00 (UTC), Data/ Data Science Track

DuckDB is revolutionizing data processing by enabling in-memory OLAP SQL operations with a lightweight, dependency-free architecture. This talk explores how DuckDB can be leveraged to handle large-scale, massively parallel data processing, ranging from hundreds of gigabytes to terabytes, outside traditional SQL and Spark warehouse systems. We will go over the integration with the Python ecosystem and demonstrate its scaling potential using the cloud compute.


In today’s data landscape, scaling efficiently outside traditional warehouse systems like SQL and Spark could significantly reduce compute costs. DuckDB, an in-memory OLAP SQL processing engine, offers a lightweight, highly performant solution with no external dependencies. In this talk, I'll cover:

1.  Overview of DuckDB
2.  DuckDB in the Python Ecosystem: integration with pandas/arrow and organized SQL pipelines 
3.  Scaling with DuckDB + AWS Lambda
4.  Scaling with DuckDB + Coiled.io: Coiled.io streamlines the management of ec2 clusters.

This session is designed for data engineers, data scientists, and analytics professionals looking to enhance their data processing capabilities without relying on traditional warehouse systems.
Familiarity with Python, SQL, and cloud services like EC2 and Lambda will make the content more accessible to grasp


Prior Knowledge Expected

Previous knowledge expected

Adarsh is a data professions with 8+ years of experience. He is currently a Sr Data Scientist at A+E Networks