PyData Global 2024

Mastering Large NDArray Handling with Blosc2 and Caterva2
12-03, 11:30–13:00 (UTC), Data/ Data Science Track

As data grows larger and more complex, efficient storage and processing become critical to achieving scalable and high-performance computing. Blosc2 (https://www.blosc.org), a powerful meta-compressor library, addresses these challenges by enabling rapid compression and decompression of large, multidimensional arrays (NDArrays). This tutorial will introduce the core concepts of working with Blosc2, focusing on how it can be leveraged to optimize both storage and computational performance in Python.

Attendees will learn how to:

  1. Efficiently create and manage large NDArrays, including options for persistence.
  2. Select the best codecs and filters for specific data types and workflows to achieve optimal compression ratios and performance.
  3. Perform computations directly on compressed data to save memory and speed up processing.
  4. Seamlessly share NDArrays using Caterva2, a versatile library designed to enable remote sharing and serving of multidimensional datasets.

This tutorial is ideal for Python developers working with large-scale data in scientific computing, machine learning, and other data-intensive fields.


In this hands-on tutorial, participants will explore advanced data handling and processing techniques using the Python wrappers of Blosc2 (https://www.blosc.org/python-blosc2/index.html), a high-performance compressor designed for large multidimensional arrays (NDArrays). With a focus on both efficiency and scalability, we will cover essential topics such as creating large, potentially persistent NDArrays, selecting suitable codecs and filters, and performing on-the-fly computations to minimize memory usage.

The tutorial will start by demonstrating how to create large NDArrays that can be stored either in memory or on disk for persistent use, taking advantage of Blosc2’s enhanced compression capabilities. Participants will learn how to select the appropriate codecs (e.g., ZSTD, LZ4, BLOSCLZ...) and filters (e.g., ByteDelta, Shuffle, BitShuffle) depending on their data characteristics, ensuring the best compression-performance trade-offs.

Next, we will delve into how Blosc2 allows for efficient computations on compressed arrays, showing how to manipulate and analyze data without fully decompressing it, which can significantly reduce the memory footprint and increase speed.

Finally, the tutorial will introduce Caterva2 (https://ironarray.io/caterva2-doc/index.html), an free software library built on top of Blosc2, that allows for easy sharing and serving of NDArrays over a network via a Python API. Participants will learn how to utilize Caterva2 to create efficient workflows for collaborating on large datasets and serving them to other applications or services.

By the end of the session, attendees will have a practical understanding of how to harness the full potential of Blosc2 and Caterva2 to optimize large data workflows and improve both storage and computation performance.


Prior Knowledge Expected

No previous knowledge expected

I am a curious person who studied Physics and Math when I was young. Through the years, I developed a passion for handling large datasets and using compression to enable their analysis using regular hardware that is accessible to everyone.

I am the CEO of ironArray SLU and also leading the Blosc Development Team, and currently interested in determining, ahead of time, which combinations of codecs and filters can provide a personalized compression experience. This way, users can choose whether they prefer a higher compression ratio, faster compression speed, or a balance between both.

As an Open Source believer, I started the PyTables project more than 20 years ago. Currently, and after 25 years in this business, I am the proudly owner of two prizes that mean a lot to me:

You can know more on what I am working on by reading my latest blogs.