Bridging Big Data and AI: Empowering PySpark with Lance Format for Multi-Modal AI Data Pipelines PyData Global 2024

Bridging Big Data and AI: Empowering PySpark with Lance Format for Multi-Modal AI Data Pipelines
.ical

12-04, 14:00–14:30 (UTC), AI/ML Track

By unifying PySpark's robust big data processing/analyzing capability with Lance's multimodal AI data lake, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics, and machine learning tasks within traditional data lake ecosystems. However, the rise of multimodal AI and vector search introduces new challenges that push beyond PySpark's native capabilities. Spark’s new Python data source API opens the door for integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. This powerful combination bridges the gap between traditional big data processing and the demands of modern AI workloads, offering a streamlined approach to handling complex, multi-modal datasets. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Prior Knowledge Expected –

No previous knowledge expected

LU QIU

Lu is a Database engineer at LanceDB. Lu builds distributed vector databases at LanceDB and integrates Lance with the big data ecosystem (Spark, Trino). She developed the distributed system Alluxio as its PMC maintainer. She's also a Data on Kubernetes Ambassador and Kubernetes community evangelist, bridging AI data infrastructure with cloud-native technologies.

Allison Wang

Allison Wang is a Software Engineer at Databricks and an Apache Spark Committer, specializing in Spark SQL and PySpark. She’s passionate about bridging Python with the big data ecosystem. Allison holds a bachelor’s degree in Computer Science from Carnegie Mellon University.

Bridging Big Data and AI: Empowering PySpark with Lance Format for Multi-Modal AI Data Pipelines .ical 12-04, 14:00–14:30 (UTC), AI/ML Track

Bridging Big Data and AI: Empowering PySpark with Lance Format for Multi-Modal AI Data Pipelines
.ical

12-04, 14:00–14:30 (UTC), AI/ML Track