PyData Global 2024

It's in the Air Tonight. Sensor Data in RAG
12-05, 18:30–20:00 (UTC), General Track

Today we will learn how to build an application around sensor data, REST Feeds, weather data, traffic cameras and vector data. We will write a simple Python application to collect various structured, semistructured data and unstructured data, We will process, enrich, augment and vectorize this data and insert it into a Vector Database to be used for semantic hybrid search and filtering. We will then build a Jupyter notebook to analyze, query and return this data.

Along the way we will learn the basics of Vector Databases and Milvus. While building it we will see the practical reasons we choose what indexes make sense, what to vectorize, how to query multiple vectors even when one is an image and one is text. We will see why we do filtering. We will then use our vector database of Air Quality readings to feed our LLM and get proper answers to Air Quality questions. I will show you how to all the steps to build a RAG application with Milvus, LangChain, Ollama, Python and Air Quality Reports. Finally after demos I will answer questions, provide the source code and additional resources including articles.


Goal of this Application
In this application, we will build an advanced data model and use it for ingest and various search options. For this notebook portion, we will

1️⃣ Ingest Data Fields, Enrich Data With Lookups, and Format :
Learn to ingest data from including JSON and Images, format and transform to optimize hybrid searches. This is done inside the streetcams.py application.

2️⃣ Store Data into Milvus:
Learn to store data into Milvus, an efficient vector database designed for high-speed similarity searches and AI applications. In this step we are optimizing data model with scalar and multiple vector fields -- one for text and one for the camera image. We do this in the streetcams.py application.

3️⃣ Use Open Source Models for Data Queries in a Hybrid Multi-Modal, Multi-Vector Search:
Discover how to use scalars and multiple vectors to query data stored in Milvus and re-rank the final results in this notebook.

4️⃣ Display resulting text and images:
Build a quick output for validation and checking in this notebook.


Prior Knowledge Expected

No previous knowledge expected

Tim Spann is a Principal. He works with Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Milvus, Generative AI, HuggingFace, Python, Java, Apache NiFi, Apache Spark, Big Data, IoT, Cloud, AI/DL, Machine Learning, and Deep Learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal Developer Advocate at Zilliz, Principal Developer Advocate at Cloudera, Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Senior Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in Computer Science.