akimbo: vectorized processing of nested/ragged dataframe columns PyData Global 2024

akimbo: vectorized processing of nested/ragged dataframe columns
.ical

12-03, 16:00–16:30 (UTC), Data/ Data Science Track

We present “akimbo”, a library bringing a numpy-like API and vector-speed processing to dataframes on the CPU or GPU. When your data is more complex than simple one-dimensional columns, this is the most natural way to perform selection, mapping and aggregations without iterating over python objects, saving a large factor in memory and processing time.

The arrow columnar data model is much more general than a collection of 1D dataframe arrays/series: it can represent “deep” data with nested records and variable-length lists. The Arrow in-memory container model is now used by several dataframe libraries, such as pandas, polars and cuDF.

Awkward-array has an excellent API over data with nested records and variable-length lists, on CPU or GPU. You can easily do numpy-style slicing, selection, ufunc mapping and aggregations - all in a way that is familiar to numerical python practitioners.

Akimbo provides an accessor to bring this awkward-array functionality to dataframes, and has exactly the same API across pandas, polars, cuDF and dask-dataframe.
Akimbo offers integration with str and dt functions, which can be applied at any level in a nested schema without unbundling the structure or iteration.

Akimbo exposes Numba integration: automatic JIT-compilation of python functions containing iterative and accumulative algorithms over the structures of nested/jagged data within a dataframe. This allows for vector speed compute even for algorithms that are hard to express with numpy idioms, and avoid temporary arrays, on the CPU or GPU.

We support loading deep data from parquet, root, avro and json, efficiently skipping parts of the data tree that are not required. However, data which is not pure columnar is very common elsewhere: you may find it in (text, xml …) log files, relational databases, scientific/industrial measurements/time-series. We do not want to normalise or “explode” this data in order to analyse it, wasting CPU and memory resources.

We also present akimbo-ip, a proof-of-concept project showing how to add type-specific functionality: vectorised operations on IP address and network types (v4 and v6), nested in deep structures. It stands as a model of how to add functionality to akimbo via sub-accessors in a similar way to .str and .dt methods. Novel type definitions and associated behaviours can also be applied at run-time and potentially stored in parquet metadata.

We’ll demonstrate runtimes and API use for a few interesting cases, including combining dataframe operations like groupby/window with vectorized operations on the sub-selections of the data.

Prior Knowledge Expected –

No previous knowledge expected

Martin Durant

akimbo: vectorized processing of nested/ragged dataframe columns .ical 12-03, 16:00–16:30 (UTC), Data/ Data Science Track

akimbo: vectorized processing of nested/ragged dataframe columns
.ical

12-03, 16:00–16:30 (UTC), Data/ Data Science Track