Hendrik Makait PyData Global 2024

Hendrik Makait
.ical

Hendrik Makait is a data and software engineer building systems at the intersection of large-scale data management and machine learning. Currently, he works as an Open Source Engineer at Coiled, maintaining and improving Dask and its distributed execution engine. His focus areas include P2P shuffling, which allows shuffling large data at constant memory, and observability with Dask to help users understand and optimize their workloads.

Sessions

12-05

16:00

30min

Dask ❤️ Xarray: Geoscience at Massive Scale

Hendrik Makait

Doing geoscience is hard. It’s even harder if you have to figure out how to handle large amounts of data!

Xarray is an open-source Python library designed to simplify the handling of labeled multi-dimensional arrays, like raster geospatial data, making it a favorite among geoscientists. It allows these scientists to easily express their computations, and is backed by Dask, a Python library for parallel and distributed computing, to scale computations to entire clusters of machines.

People love using Xarray on Dask for geospatial workloads, but only up to about the terabyte scale. At this point, the stack can struggle, requiring expertise to work well and frustrating users and developers alike.

To address this and enable the Dask ❤️ Xarray stack to scale to hundreds of terabytes, we have recently designed a suite of large-scale geospatial benchmarks. With the help of these benchmarks, we are able to understand what limits performance within Dask and Xarray, and to address these issues.
In this talk, we will explore how Dask integrates with libraries like Xarray and Zarr to scale geospatial workloads and other multi-dimensional array computations.

We will also dive deeper into some of the bottlenecks in the Dask ❤️ Xarray stack that our benchmarks revealed, as well as some of the recent improvements we have made in these areas. With the help of our benchmark suite, we then assess the impact of these changes.

Join us to discover how Dask helps you scale geoscience workloads from your laptop to the cloud.

Data/ Data Science Track

Hendrik Makait .ical

Sessions

Hendrik Makait
.ical