PyData Global 2024

Building a Real-Time Data Pipeline with Flink, Druid, and Python
12-05, 10:30–12:00 (UTC), Data/ Data Science Track

Drowning in data? Struggling to make real-time decisions as information flows in faster than ever? This talk reveals how Python developers can harness the combined power of Apache Flink and Druid to conquer the challenges of real-time data processing and analysis.

Today's businesses demand immediate insights from ever-growing data streams. Apache Flink rises to this challenge with low-latency processing and sophisticated handling of out-of-order events, ensuring accuracy with exactly-once semantics. We'll explore Flink's Python API, focusing on its time and windowing capabilities that guarantee reliable data processing even in complex scenarios.

But Flink is more than just a pipeline. We'll showcase how it surpasses traditional solutions like Kafka, especially for complex event processing and dynamic windowing. Then, we'll introduce Apache Druid, a high-performance analytical database built for rapid queries on massive datasets. See how Flink efficiently feeds pre-processed data into Druid, transforming it into your real-time analytical engine, seamlessly integrated with your Python workflows. Dive in and discover the future of data-driven decision-making.


Are you ready to unlock the true potential of real-time data? In this talk, we'll equip Python developers with the tools and knowledge to build high-performance, real-time applications using Apache Flink and Druid.

Here's what you'll learn:

Mastering Real-Time Data with Flink:
Deep dive into Flink's architecture and its advantages for low-latency stream processing.
Understand how Flink handles out-of-order events and guarantees exactly-once semantics for accurate data processing.
Explore Flink's Python API, with a focus on its powerful time and windowing capabilities.
Beyond Kafka: Why Flink Excels:
Compare Flink with Kafka, highlighting Flink's strengths in complex event processing and dynamic windowing.
Discover how Flink simplifies and accelerates ETL pipelines for efficient data transformation.
Integrating with Druid for Real-Time Analytics:
Introduction to Apache Druid, a high-performance analytical database for querying massive datasets.
Learn how to seamlessly integrate Flink with Druid: Flink for pre-processing and feeding data, Druid for real-time analysis and exploration.
See how this combination empowers you to build truly insightful data-driven applications in Python.
Real-World Use Cases:

Fraud Detection in Financial Services: Detect fraudulent transactions in real-time with Flink and Druid, minimizing financial losses and enhancing security.
Dynamic Route Optimization: Optimize routes for ride-sharing services or logistics companies on-the-fly, adapting to traffic conditions and demand changes in real-time.
Join us to uncover how Apache Flink is redefining real-time stream processing and why it's a crucial tool for modern data-driven solutions.


Prior Knowledge Expected

Previous knowledge expected

Shekhar is deeply passionate about open source software and actively contributes to various projects, including SymPy, Ruby gems like daru and daru-view (which he authored), Bundler, NumPy/SciPy, Apache Projects like Druid, Kafka .
He successfully completed Google Summer of Code in 2016 and 2017 and has served as an admin for SciRuby, mentoring multiple organizations.
Shekhar has spoken at prominent conferences such as RubyConf 2018, PyCon 2017, ApacheCon 2020, and Community Over Code 2024, as well as numerous regional meetups. Currently, he works at Apple as a Software Development Engineer.