PyData Global 2024

New Features in Apache Spark 4.0
12-04, 20:30–21:00 (UTC), Data/ Data Science Track

The upcoming release of Apache Spark 4.0 delivers substantial enhancements that refine the functionality and augment the developer experience with the Spark unified analytics engine.

Attendees will learn how to use Apache Spark 4.0's advancements for optimized data processing and analytics.


This presentation will highlight:

  • Python custom data sources!
  • Spark Connect for enhanced usability and debuggability.
  • Structured Logging for better error analysis and streamlined debugging.
  • Significant PySpark updates, including python data source APIs, arrow-optimized UDFs, polymorphic Python UDTFs, and improved UDF profiling, aligning with pandas 2.x for complex data workflows.
  • Expanded SQL capabilities through ANSI SQL compliance, and new SQL Cache V2, UDF, and Collation support.
  • Enhanced connectivity with new native XML connector.
  • Improvements in real-time data processing with the Arbitrary State API v2 and State Data source reader for Structured Streaming.

Prior Knowledge Expected

No previous knowledge expected

Matthew Powers is a Developer Advocate.

He focuses on blogging, social media, coding, and community development for DataFusion, Polars, Spark, Delta Lake, and other related Data Engineering technologies.

He tries to teach concepts in an easily digestible manner and focus on core concepts.

He likes separating usage guides from theory, so learners that just want to get their job done are not bogged down with the theory.