PyData Global 2024
As data grows larger and more complex, efficient storage and processing become critical to achieving scalable and high-performance computing. Blosc2 (https://www.blosc.org), a powerful meta-compressor library, addresses these challenges by enabling rapid compression and decompression of large, multidimensional arrays (NDArrays). This tutorial will introduce the core concepts of working with Blosc2, focusing on how it can be leveraged to optimize both storage and computational performance in Python.
Attendees will learn how to:
- Efficiently create and manage large NDArrays, including options for persistence.
- Select the best codecs and filters for specific data types and workflows to achieve optimal compression ratios and performance.
- Perform computations directly on compressed data to save memory and speed up processing.
- Seamlessly share NDArrays using Caterva2, a versatile library designed to enable remote sharing and serving of multidimensional datasets.
This tutorial is ideal for Python developers working with large-scale data in scientific computing, machine learning, and other data-intensive fields.
Hi! Have you ever wished your pure Python libraries were faster? Or wanted to fundamentally improve a Python library by rewriting everything in a faster language like C or Rust? Well, wish no more... NetworkX's backend dispatching mechanism redirects your plain old NetworkX function calls to a FASTER implementation present in a separate backend package by leveraging the Python's entry_point
specification!
NetworkX is a popular, pure Python library used for graph(aka network) analysis. But when the graph size increases (like a network of everyone in the world), then NetworkX algorithms could take days to solve a simple graph analysis problem. So, to address these performance issues, a backend dispatching mechanism was recently developed. In this talk, we will unveil this dispatching mechanism and its implementation details, and how we can use it just by specifying a backend
kwarg like this:
>>> nx.betweenness_centrality(G, backend=“parallel”)
or by passing the backend graph object(type-based dispatching):
>>> H = nxp.ParallelGraph(G)
>>> nx.betweenness_centrality(H)
We'll also go over the limitations of this dispatch mechanism. Then we’ll use the example of nx-parallel backend as a guide to understand various NetworkX backend and backend configuration features.
Finally, I'll conclude with a brief overview of how this API dispatch mechanism could be integrated in an non-graph-related Python libraries, such as an array-based or data-centric libraries, along with the potential challenges that may arise during integration. This will be followed by an interactive Q&A session.
As organizations increasingly integrate and adopt AI and machine learning internally, the challenge of maintaining separate pipelines for ML-powered systems and conventional software makes it difficult for DevOps teams to maintain these separate pipelines. This talk explores a unified approach to DevOps and MLOps, demonstrating how existing DevOps pipelines can be transformed into efficient MLOps pipelines using ModelKits with KitOps
We'll begin by examining the reasons behind the traditional separation of DevOps and MLOps pipelines, including differences in project nature, required expertise, and the size and complexity of artifacts. We'll then delve into the challenges posed by separate pipelines, such as increased costs, coordination difficulties, and accumulating technical debt. Thus the attendees will learn how to leverage open source tooling like KitOps to create a unified pipeline that accommodates both traditional software and ML-powered projects, ultimately leading to more efficient and cost-effective operations.
skchange is a python compatible framework library for detecting anomalies, changepoints in time series, and segmentation.
skchange is based on and extends sktime, the most widely used scikit-learn compatible framework library for learning with time series. Both packages are maintained under permissive license, easily extensible by anyone, and interoperable with the python data science stack.
This workshop gives a hands-on introduction to the new joint detection interface developed in skchange and sktime, for detecting point anomalies, changepoints, and segment anomalies, in unsupervised, semi-supervised, and supervised settings.
Non-Intrusive Load Monitoring (NILM) is a key technique in data-driven energy management and home automation, aimed at disaggregating energy consumption to identify active appliances in households and quantify their energy usage. This presentation:
- Provides an overview of NILM, highlighting its advantages and reviewing state-of-the-art deep learning algorithms developed for this purpose.
- Examines smart meters and IoT devices in energy systems, with a focus on the Chain2 protocol used in Italian energy systems. This event-based protocol generates low-volume data, enabling real-time energy monitoring and alerting.
- Presents examples of deep learning models trained on real-world IoT sensor data from energy meters, demonstrating their application in energy disaggregation.
This session offers an insightful overview of real-world deep learning applications in energy systems. While tailored for data scientists and data engineers interested in these fields, no prior knowledge is required. Join to explore how these technologies are driving energy optimization, cost reduction, and enhancing personal energy consumption awareness.
Large Language Models are great at writing and chatting, but are they also able to talk like a human? Today, modern LLM-based voice bots can listen to users, talk back to them with a realistic voice, handle interruptions and improvise, while sticking to the goal they're given by their builders. And this is not only true for the latest, eye-watering expensive OpenAI's models! In this session we will learn how modern voice bots are made, which open source tools are available to build them, and we are going to see in practice how to build one. At the end of the session, the demo's full source code will be shared with the audience.
Streamlining clinical trial output workflows is a key challenge in clinical studies. To deliver reports to health authorities, clinical trial statisticians need to create several scripts to produce deliverables such as output datasets, tables, figures, and listings. Statisticians must also handle specific execution orders to respect dependencies between the generated datasets.
Our project leverages Python programming to automatically generate orchestration workflows from clinical trial project metadata using the Snakemake framework. Snakemake supports the execution of multiple jobs using Docker containers, facilitating multilingual orchestration. This enables our users to run end-to-end (E2E) data engineering workflows using their preferred programming languages, primarily SAS and R. Moreover, Snakemake allows parallel runs for efficient workflow management.
Discover why the Unix command line remains a powerful and relevant tool for data scientists, even in a Python-dominated landscape. This talk will demonstrate how embracing the command line and leveraging its many tools can significantly enhance your productivity, streamline data workflows, and complement your Python skills.
This talk will cover how to use pre-trained HuggingFace models, specifically wav2vec 2.0 and WavLM, to detect audio deepfakes. These deepfakes, made possible by advanced voice cloning tools like ElevenLabs and Respeecher, present risks in areas like misinformation, fraud, and privacy violations. The session will introduce deepfake audio, discuss current trends in voice cloning, and provide a hands-on tutorial for using these transformer-based models to identify synthetic voices by spotting subtle anomalies. Participants will learn how to set up these models, analyze deepfake audio datasets, and assess detection performance, bridging the gap between speech generation and detection technologies.
We present “akimbo”, a library bringing a numpy-like API and vector-speed processing to dataframes on the CPU or GPU. When your data is more complex than simple one-dimensional columns, this is the most natural way to perform selection, mapping and aggregations without iterating over python objects, saving a large factor in memory and processing time.
Unlike stylized machine learning examples in textbooks and lectures, data are often not readily available to be used to train models and gain insight in real-world applications; instead, practitioners are required to collect those data themselves.
However, data annotation can be expensive (in terms of time, money, or some safety-critical conditions), thus limiting the amount of data we can possibly obtain.
(Examples include eliciting an online shopper's preference with ads at the risk of being intrusive, or conducting an expensive survey to understand the market of a given product.)
Further, not all data are created equal: some are more informative than others.
For example, a data point that is similar to one already in our training set is unlikely to give us new information; conversely, a point that is different from the data we have thus far could yield novel insight.
These considerations motivate a way for us to identify the most informative data points to label and gain knowledge in a way that makes use of our labeling budget as effectively as possible.
Bayesian experimental design (BED) formalizes this framework, leveraging the tools from Bayesian statistics and machine learning to answer the question: which data point is the most valuable that should be labeled to improve our knowledge?
This talk serves as a friendly introduction to BED including its motivation as discussed above, how it works, and how to implement it in Python.
During our discussions, we will show that interestingly, binary search, a popular algorithm in computer science, is a special case of BED.
Data scientists and ML practitioners who are interested in decision-making under uncertainty and probabilistic ML will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, and common probability distributions (normal, uniform, etc.).
Providing timely maternal healthcare in developing countries is a critical challenge. This talk demonstrates how data-driven solutions can bridge healthcare gaps and improve access to vital healthcare information for pregnant women, with user privacy in mind. To do so, we fine-tuned the Gemma-2 2 billion parameter instruction model on a synthetic dataset in order to detect whether user messages pertain to urgent or non-urgent maternal healthcare issues. By quickly identifying and prioritizing user inquiries, the model can aid help desks by ensuring urgent messages are promptly forwarded to the appropriate healthcare professionals for immediate intervention.
Enjoy some data-driven laughs with Evan Wimpey, a data and analytics comedian (and we're not just talking about his coding skills). No data topic is off-limits, so come enjoy some of the funniest jokes ever told at a data conference.*
*Note the baseline
Rapid adoption of generative AI requires ensuring your application is trustworthy. Careful experimentation and measurement are necessary for this new era of non-deterministic software. In this talk, we will take learnings from 100s of conversations across enterprise AI teams, and discuss how developers can mitigate hallucinations, better inspect their AI systems, and productionize applications with effective guardrails and evaluation checks in place.
Geospatial data is more important than ever for tackling real-world challenges like urban planning and climate change. This tutorial teaches you how to use tools like CesiumJS and Python to turn raw data into interactive 3D visuals. It’s a hands-on way to bring data to life and try to make an impact.
In the rapidly evolving field of natural language processing, the evaluation of large language models (LLMs) is crucial for understanding their performance and guiding their development. This talk delves into the two primary evaluation methodologies: reference-based and reference less techniques.
Asynchronous programming can be intimidating for many due to its unique syntax, paradigm, and different behavior in environments like IPython and Jupyter notebooks.
But it’s not that complicated—and I'll prove it. In this talk, I will demystify the basics, along with some advanced concepts, from a practical perspective. By the end, you'll be ready to get started and implement significant performance improvements in your network or I/O-bound code.
Attend this talk if you’ve been intimidated by async
and await
for a while and are ready to change that.
In this 90-minute workshop, machine learning engineers and data scientists will learn practical techniques for identifying and mitigating age bias in AI-driven hiring systems. We’ll explore fairness metrics like statistical parity, counterfactual fairness, and equalized odds, and demonstrate how tools such as Fairlearn, Aequitas, and IBM Fairness 360 can be used to monitor and improve model fairness. Through hands-on exercises, participants will walk away with the skills to evaluate and de-bias models in high-risk areas like recruitment.
Plotnine is a Python library that implements the Grammar of Graphics, enabling users to create complex, layered plots. This talk covers techniques for customising your plots, using time series data as an example, and highlights how plotnine integrates with matplotlib, allowing you to enhance your data visualisations for better storytelling.
Polars boasts 18 different data types, not including variants of numerical types.
Do we really need such a vast collection of data types?
What is the use case for each type?
What is the difference between List
and Array
? Or between Categorical
and Enum
? And why on Earth would I ever need a Struct
?
This talk will clear up all of these questions and more, as we go through the data types that Polars provides and understand why we need each one of them.
Data scientists in the real world have to manage messy datasets that evolve over time. New data must be added, old data must be removed and changes to columns must be handled gracefully. Furthermore, many real world datasets grow from a size that works on a laptop to a size that must run on a server. This talk will show that in Python we can meet all these challenges in a simple and scalable way using the delta-rs package to manage the data storage and Polars to read and write the dataset.
This tutorial introduces Pixeltable, which provides data-centric AI infrastructure with a declarative, incremental approach for multimodal workloads. Participants will learn to manage multimodal data (text, images, video) using Pixeltable's declarative interface. We'll cover data versioning, indexing, and orchestration through computed columns and iterators. Attendees will gain practical experience with Pixeltable's integration capabilities and custom UDFs.
Requirements: Python knowledge, basic ML concepts. Materials will be available via a GitHub repository and Google Colab notebooks.
How do you know when a user experience isn’t hitting the mark? Do you wait for it to show up in qualitative feedback? Do you have a long list of different metrics that you have to keep track of that could potentially signal a problem? When evaluating user experiences, how can you quantify if it’s a good experience or not? Additionally, how do you know if your good or bad experience is impacting other areas of the business?
These are common problems for product managers and the data scientists and analysts who support them. To solve them, I propose creating an aggregate metric that represents the effort or friction experienced by your users - a User Effort Index.
Beneath the buzz of AI breakthroughs, a quiet revolution is unfolding in the world of forecasting: foundational time series models. These models promise to change the game for operational forecasting, but don’t expect magic. You won’t suddenly become a stock market oracle just by throwing data at them.
In this talk, we’ll peel back the layers of these new time series models, starting with how they work and how they evolved from transformers. We’ll tackle the big problems of limited data and overhyped algorithms, and explore the real-world challenges that make or break forecasts (hint: human input matters).
To identify a production-ready, open-source OCR model capable of handling sensitive, non-English content with highly technical language, we evaluated the performance of available open-source OCR models in terms of accuracy, memory efficiency, and processing speed. This presentation will share our findings and key insights gained from this research
MAPIE (Model Agnostic Prediction Interval Estimator) is your go-to solution for managing uncertainties and risks in machine learning models. This Python library, nestled within scikit-learn-contrib, offers a way to calculate prediction sets with controlled coverage rates for regression and classification tasks.
But it doesn't stop there - MAPIE can also be used to handle more complex tasks like time series analysis, multi-label classification, computer vision and natural language processing, ensuring probabilistic guarantees on crucial metrics.
Join us as we delve into the world of conformal predictions and how to quickly manage your uncertainties using MAPIE.
This proposal aims to develop a Python curriculum for data science for multidisciplinary studies in university education. Data Science is nowadays a trending topic in any area like social science, finance, natural science and so many others. Therefore, every student in the university education is keen to learn data science using computer languages rather than using SPSS or other traditional data analysis tools especially related to research. So, this aims to develop a new curriculum for any student studying from any discipline in higher education to learn data science using trending techniques and tools. Python is the core programming language here because it is very widely used and related to data science field. Plus, it has many advantages like easy to learn and use, platform independence used, large and active community support. Utilizing Bloom’s Taxonomy as the guiding framework has developed a new curriculum for four-year degree programs to succeed in data driven world considering multidisciplinary approach. In this curriculum, students can start from Python basic programming concepts to progress to advanced analyzing techniques using libraries like Pandas, NumPy, and Seaborn, and platforms such as Anaconda and Google Colab and finally build own projects in that students related discipline. Ultimately this curriculum will leverage success in Data-centric society in domain specific applications.
Keywords: Bloom’s, curriculum, multidisciplinary, python, science, taxonomy
This talk explores how to align large language models (LLMs) with human values via preference learning (PL) in the presence of challenges such as incomplete and corrupted data in preference datasets. We propose a novel method for recalibrating values to tackle these issues, enhancing LLM resilience by improving the robustness of existing models. The session highlights real-world experiments that show how the method addresses adversarial noise and unobserved comparisons, making it essential for building more reliable, ethically aligned AI systems.
To apply or not to apply, that is the question.
Causal reasoning elevates predictive outcomes by shifting from “what happened” to “what would happen if”. Yet, implementing causality can be challenging or even infeasible in some contexts. This talk explores how the very act of assessing its applicability can add value to your projects. Through a gentle introduction to causal inference tools and practical use cases, you will learn how to bring greater scientific rigour to real-world problems.
Target audience: Practicing and aspiring data scientists, machine learning engineers, and analysts looking to improve their decision-making with causal inference.
No prior knowledge is assumed.
For the seasoned practitioners I hope to shine light on aspects that may not have been considered. 💡
Can't make the talk? Read all about it in my new TDS article: 🧠🧹 Causality — Mental Hygiene for Data Science
Debugging software itself is a hard task, but debugging GPU software environments can be even more challenging. Understanding the intricate interactions between hardware, drivers, CUDA, C++ dependencies, and Python libraries can be far more complex.
In this talk we will dig into how these different layers interact and how you can address some of the common pitfalls that folks run into when configuring GPU Python environments. We will also introduce a new tool, RAPIDS Doctor, that aims to take the challenge out of ensuring your software environments are in good shape. RAPIDS Doctor checks and diagnoses environmental health issues straight from the command line, ensuring that your setup is fully functional and optimized for performance.
Learn to build powerful sensors running on low-cost microcontrollers, all in Python!
Did you known that (Micro)Python can scale all the way down to microcontrollers
that have less than 1 MB of RAM and program memory? Such devices can cost just a few dollars, and are widely used to measure, log, analyze and react to physical phenomena. This enables a wide range of useful and fun applications - be it for a smart home, wearables, scientific measurements, consumer products or industrial solutions.
In this talk, we will demonstrate how to get started with MicroPython on a ESP32 microcontroller.
We will first show how to create a basic Internet-connected sensor node using simple analog/digital sensors. And then we will show how to create advanced sensors that use Digital Signal Processing and Machine Learning to analyze microphone, accelerometer or camera data.
Data quality is a crucial factor that significantly impacts the performance of machine learning models. However, many data scientists often overlook or underestimate the hidden costs associated with poor data quality. This talk will highlight common data challenges, and discuss their implications for model accuracy and reliability. Attendees will learn practical strategies to identify, assess, and improve data quality, ensuring their machine learning projects yield better results.
Traditional document processing for Retrieval-Augmented Generation (RAG) often involves cumbersome, error-prone extraction pipelines, hampering AI's ability to retrieve high-quality information from complex formats like PDFs and PowerPoint decks. ColPali disrupts this process by embedding entire pages—text, visuals, and layout—into rich, multi-vector representations using Vision Language Models (VLMs). This talk explores how ColPali, paired with multimodal models like the Llama 3.2 Vision series, enables RAG systems to “see” and reason over documents, dramatically improving retrieval performance. Attendees will learn to implement ColPali for enhanced, scalable, and robust enterprise knowledge retrieval.
Colab Notebook Link: https://colab.research.google.com/drive/1faxDHE3LdAwH7MORdnJei87Q0WF1BhS0?usp=sharing
Make a copy to your local drive to start working on this notebook.
Ever wondered how groundbreaking language models like ChatGPT and Llama were built? The answer lies in transformer, a powerful neural network architecture. In this workshop, we'll dive deep into the inner workings of transformers, with specific focus on self-attention mechanism. We will guide you through the process of building one from scratch. Whether you're a beginner or an experienced practitioner, this workshop is designed to cater to all levels of expertise.
Abstract:
As the climate changes, farmers in Africa are facing enormous challenges, from unpredictable rainfall to shifting growing seasons. In this session, I will share how we can use machine learning (ML) models, built on open-source platforms like TensorFlow and Google Earth Engine, to predict crop yields for key staples such as maize and cassava. By looking at case studies from Kenya, Ghana, and Malawi, I'll show how ML is helping farmers decide when to plant, manage resources more efficiently, and reduce climate risks. I’ll also talk about practical tools—like community hubs, radio broadcasts, and SMS alerts—that ensure even non-literate farmers can use these insights. Expect to walk away with actionable ideas on how to implement these techniques in your own work on food security.
data.table is an R package with C code that is one of the most efficient open-source in-memory data manipulation packages available today. First released to CRAN by Matt Dowle in 2006, it continues to grow in popularity, and now over 1500 other CRAN packages depend on data.table. This talk will start with data reading from CSV, discuss basic and advanced data manipulation topics, and finally will end with a discussion about how you can contribute to data.table.
By unifying PySpark's robust big data processing/analyzing capability with Lance's multimodal AI data lake, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.
“I like waiting for my build jobs,” said no one ever. CI is an essential part of ensuring quality, helping to highlight new issues before they might be merged into the main codebase. CI gives us confidence that the code changes being proposed don’t break things, as least as far as our tests cover. That confidence comes at the cost of time and compute resources.
The RAPIDS team at NVIDIA manages its own operations and compute resources. Those resources are limited, of course, so we wait our turn and put the toys back when we’re done.. It is essential to us that we are using our resources as efficiently as possible. This is the “Speed of Light” principle at NVIDIA: how close are you to a theoretical optimal limit? For CI, this involves several factors: startup wait time, docker image setup time, cache utilization, build tool processes, and limiting unnecessary redoing builds and tests for things that haven’t changed. The RAPIDS team set out to add telemetry to all of our builds, so that we can quantify where we are spending our time and compute resources, and ensure that we are spending them wisely. We’ll demonstrate the telemetry tools that we’re using, and show how you can add them to your build jobs.
The R Development Guide (R Dev Guide) serves as a resource for onboarding new contributors to the R project. Initially drafted in 2021 and then expanded during the Google Season of Docs 2022, the guide has evolved to make contributing more accessible, especially for newcomers. This talk will explore the latest developments in the guide, its impact on the R community, and how it fosters inclusivity within the project by simplifying the contribution process.
Feature selection is an essential process in machine learning, especially when dealing with high-dimensional datasets. It helps reduce the complexity of machine learning models, improve performance, mitigate overfitting, and decrease computation time. This talk will present a novel open source feature selection framework, shap-select.
Shap-select is noteworthy because of its simplicity - it requires only one fit of the model for which one does feature selection, and yet performs comparably to much heavier methods. It conducts a linear or logistic regression of the target on the Shapley values of the features, on the validation set, and uses the signs and significance levels of the regression coefficients to implement an efficient heuristic for feature selection in tabular regression and classification tasks.
We compare this to several other methods, showing that shap-select combines interpretability, computational efficiency, and performance, offering a robust solution for feature selection, especially for real-world cases where model fitting is computationally expensive.
Join us for an exciting keynote from Peter Wang
Learn how to get started on your online ML journey with River, an open source Python ML library. The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach. This has benefits such as lower computational requirements due to being able to incrementally learn from a stream of data points, which enables the continual upgrading of models by adapting to real-time changes in data.
In this talk, I will offer my perspective on the modern data tools landscape and in particular user-facing tools for interactive data science and data exploration. The latest trends of composable data systems and embeddable query engines like DuckDB and DataFusion create both challenges and opportunities to create a more coherent and productive stack of tools for both end user data scientists and developers building data systems.
Having worked on Kaggle's LLM-based ARC AGI program-writing challenge for 6 months using Llama3, I'll give reflections on the lessons learned making an automatic program generator, evaluating it, coming up with strong representations for the challenge, chain-of-thought and program-of-thought styles and some multi-stage critical thinking approaches. You'll get tips for tuning your own prompts and shortcuts to help you evaluate your own LLM usage with greater assurance in the face of non-deterministic outcomes.
This talk will go over an application scenario that brings together the benefits of vector search with graph traversal. Knowledge graphs (or more generally, graphs), have long been used to model structured data that capture the connection between entities in the real world. Recently, there has been a lot of interest in the topic of Graph RAG, which aims to use graphs as part of the retrieval process in RAG, to enhance the outcomes. The talk will cover a practical example to showcase how Python developers can leverage the PyData ecosystem alongside two open source, embedded databases: Kùzu for the graph component, and LanceDB for the vector component of the retrieval.
Component-based modeling systems such as Simulink and Dymola allow for building scientific models in a way that can be composed. For example, Bob can build a model of an engine, and Alice can build a model of a drive shaft, and you can then connect the two models and have a model of a car. These kinds of tools are used all throughout industrial modeling and simulation in order to allow for "separation of concerns", allowing experts to engineer their domain and compose the final digital twins with reusable scientific modules. But what about open source? In this talk we will introduce ModelingToolkit, an open source component-based modeling framework that allows for composing pre-built models and scales to large high-fidelity digital twins.
Data deduplication is a ubiquitous data quality problem that most data people will encounter at some point in their career. It happens whenever multiple records are collected about the same person or other entity without a unique identifier that ties these records together.
This talk provides beginners with everything they need to start linking and deduping large datasets using Splink, a free Python library.
"What training data do you need, don't you just train on the whole internet?"
"Doesn't data production rely heavily on outsourcing to cheap labour markets in the Global South?"
"Isn't all training data just synthetic nonsense generated by LLMs nowadays, how can you expect a model to learn anything worthwhile?"
These are all questions that I regularly get, when I tell people I work on building foundational LLMs. Because as often as we use LLMs in our daily lives nowadays, people generally know very little of the data that went into the LLM to train it.
In this talk, I'll address these questions and hope to build an understanding of what it takes to build an LLM from scratch, from a data perspective.
Generative AI is revolutionizing industries by enhancing efficiency, personalization, and insight. This talk explores how a robust Python ecosystem, including Streamlit, various libraries, and APIs, is harnessed to build powerful generative AI applications. Attendees will gain insights into the practical implementation of these technologies and their transformative impact on business operations.
Anomaly detection is hardly a new problem, nor is the progress in it as rapid as the LLM blast we’re witnessing today. But it is pressing.
In this talk, we’ll talk about a realtime anomaly detection pipeline on time series data and discuss the nitty-gritties of the algorithm knobs that help us build an unbiased and reliable system, which includes 1) using NeuralProphet, an open source framework, to forecast for time series data and 2) using robust techniques to detect true anomalies using forecasting errors.
Due to its high-level syntax and powerful interactive prompt, Julia is typically used as a computational front-end language. However there is growing interest in using Julia to develop statically-compiled libraries to be called from other languages (Python, C++, etc.). I will present recent and ongoing work happening in the Julia community to enable this use case, including building smaller binaries and static analysis tooling.
An introduction to solving combinatorial optimization and constraint satisfaction problems in Python. I will review the most popular libraries for SAT/CSP. We will then deep dive to a crash corse on using Google's award winning OR-tools library, for efficiently solving some non-trivial real-world constrained combinatorial optimization problems.
Data Prep Kit is a new open source python library to help you wrangle and clean your data for generative AI applications (de-dupe, detect language, removing PII, detect malware, creating embeddings, etc.)
Learn how we built a lightning-fast search engine using Python, balancing speed, relevance, and scalability. In this session, we’ll explore our hybrid approach, blending vector search with traditional keyword indexing to deliver high quality, accurate results. Discover how we harness a high-performance NoSQL database for efficient data management and fine-tune our results with a re-ranking algorithm for top-notch accuracy.
We’ll dive into the hurdles we overcame, like ensuring data consistency in a NoSQL setup, balancing search precision and performance, and designing a scalable architecture. By the end, you’ll understand how this Python-powered engine works, its real-world applications, and the innovative solutions that set it apart.
This talk is an introduction to automatic differentiation with a focus on the Python and Julia ecosystems. We will first explain what autodiff is and how it works, then describe its various implementations in both languages. Our goal is to give everyone a good understanding of how computer code can be differentiated, while also discussing the trade-offs this differentiability entails.
Transformers are everywhere: NLP, Computer Vision, sound generation and even protein-folding. Why not in forecasting? After all, what ChatGPT does is predicting the next word. Why this architecture isn't state-of-the-art in the time series domain?
In this talk, you will understand how Amazon Chronos and Salesforece's Moirai transformer-based forecasting models work, the datasets used to train them and how to evaluate them to see if they are a good fit for your use-case.
Data rules the world and data-scientists / MLEs across academia and industry are creating new and innovative ways to glean insights which have changed our lives through easy to understand and intuitive interfaces. At the heart of the AI / ML revolution ( genAI, LLMs, bioinformatics, climate science etc ) is the availability and elasticity of state of the art hardware which enables processing large swaths of data ( TBs ) that could not run on local laptops for want of compute/memory. Cloud providers have commoditized these powerful machines to the extent that they are now available to every person with a few clicks.
Cloud computing allows us to tradeoff upfront hardware costs for granular operational expenses such as renting GPUs by the second. Prima facie this might seem like a winning formula, a key downside is that these costs often add up uncontrollably. Attributing the usage of such hardware to Data/AI/ML jobs across dimensions like cloud accounts, instances, workloads down to the lowest level of granularity, can help provide transparency to not only cost albeit resource management as well.
Through our work with open-source Metaflow, which started at Netflix in 2017, we have had an opportunity to help customers place their cloud spend in the context of value produced by individual projects combined with more granular resource management to limit spend.
In this talk, we will provide an overview of the lessons we have learnt in our quest to get a better handle on costs by using Metaflow. We will share best practices to consider when writing AI/ML workloads and how constructs in the Metaflow framework can be used to answer questions Data-Scientists/MLE’s ask themselves such as:
How do my cloud costs break down over time and what workloads/cloud instances are driving these costs?
Are the workloads executing tuned to allow maximum usage of these expensive resources?
How can I refactor my workloads such that the expensive resources are used to their optimal capacity?
In particular, we'll focus on best practices to follow when working with large datasets in a distributed multi cloud / cluster environments, and how Metaflow constructs can help achieve that in a human friendly manner, with very few lines of code.
The audience will be empowered to build and deploy production-grade Data/AI/ML pipelines while learning strategies on how to optimize workloads to keep expensive ML/AI operations under control. Finally, the audience will have the tools to answer questions like “Am I using my resources to their fullest extent? If not, what are the opportunities for tuning my AI/ML jobs resource requirements, to bin pack hardware and subsequently reduces overall costs”
Faustream is an open-source tool I developed that bridges the gap between streaming data and real-time predictive analytics. This talk explores how Faustream leverages Python, Kafka, and Faust to handle high-velocity data streams while applying machine learning models in real-time. We'll dive into its architecture, key features, and applications, demonstrating how it can revolutionize data processing across industries.
What if deploying a Python app was as simple as a single click, and came at zero cost? With PyCafe, you can offer users live, interactive examples of your libraries or have them submit reproducible examples when reporting issues.
Built on top of Pyodide, PyCafe runs countless web frameworks (e.g. streamlit, dash, panel, gradio) directly in the browser. By making apps easy to create, share, and edit, PyCafe opens up new workflows, including possibilities we may not have even imagined yet.
As data scientists and machine learning engineers, it is crucial that we can reproduce results and seamlessly share projects across teams and stakeholders. However, differing operating systems, Python environments, package versions, and package managers often hinder reproducibility across different machines. This talk will explore how Nix can be leveraged to create reproducible work environments and how it can be a convenient tool for any Data Scientist or ML Engineer.
Julia is a high-performance language for technical computing that offers advantages like type stability, just-in-time compilation, and extensive parallel computing support. Its Machine Learning ecosystem, although having fewer options, is functional and includes packages like DataFrames.jl, Flux.jl, MLJ.jl, and SciML for various ML tasks. Additional tools cover data visualization, R compatibility, and specific ML applications. The ecosystem is comprehensive and can meet many ML researcher/professional needs. This talk provides an overview of the ecosystem, discussing both its strengths and potential areas for improvement.
Writing GPU code in Python is easier today than ever, and in this tutorial, we will cover how you can get started with accelerating your code.
You don't need to learn C++ and you don't need new development tools.
Attendees will be expected to have a general knowledge of Python and programming concepts, but no GPU experience will be necessary. Our key takeaway for attendees will be the knowledge that they don’t have to do much differently to get their code running on a GPU.
This tutorial empowers deep learning practitioners to master the entire PyTorch workflow, from efficient model creation to advanced tracking and optimization techniques. We'll begin by exploring a practical PyTorch workflow, then delve into integrating popular experiment tracking tools like MLFlow and Weights & Biases. You'll learn to log custom metrics, artifacts, and interactive visualizations, enhancing your model development process. Finally, we'll tackle hyperparameter optimization using Optuna's Bayesian search, all while maintaining meticulous experiment tracking for easy comparison and reproducibility.
By the end of the session, you'll have constructed a robust, modular pipeline for managing experiments and optimizing model performance. Whether you're new to PyTorch or an experienced data scientist looking to improve your workflow, this hands-on tutorial offers immediately applicable insights and techniques to enhance your deep learning projects across diverse domains.
Learn how to write a native Python application in the browser using WebAssembly enabled by PyScript.
The upcoming release of Apache Spark 4.0 delivers substantial enhancements that refine the functionality and augment the developer experience with the Spark unified analytics engine.
Attendees will learn how to use Apache Spark 4.0's advancements for optimized data processing and analytics.
Shiny for Python is an efficient and reactive application framework that will be able to grow with your application needs.
As your shiny application grows, you may find yourself needing more custom behaviors and potentially reusing and sharing
your custom behaviors with others.
You may also find your existing applications to be overly complex and had to see the overall structure of the application.
Here are some tips on writing better Shiny Applications and leveling up your code.
What if designing data workflows felt like snapping together LEGO blocks? In this talk, we’ll explore how open-source tools enable flexible, modular PyData workflows. We’ll discuss why open source is essential for avoiding vendor lock-in and how to integrate libraries and frameworks within the Python ecosystem, alongside tools like GitHub Actions. Plus, I’ll introduce DataJourney, an open-source toolkit I developed that makes designing workflows as fun and creative as building with LEGO.
Let’s dive in!
Drowning in data? Struggling to make real-time decisions as information flows in faster than ever? This talk reveals how Python developers can harness the combined power of Apache Flink and Druid to conquer the challenges of real-time data processing and analysis.
Today's businesses demand immediate insights from ever-growing data streams. Apache Flink rises to this challenge with low-latency processing and sophisticated handling of out-of-order events, ensuring accuracy with exactly-once semantics. We'll explore Flink's Python API, focusing on its time and windowing capabilities that guarantee reliable data processing even in complex scenarios.
But Flink is more than just a pipeline. We'll showcase how it surpasses traditional solutions like Kafka, especially for complex event processing and dynamic windowing. Then, we'll introduce Apache Druid, a high-performance analytical database built for rapid queries on massive datasets. See how Flink efficiently feeds pre-processed data into Druid, transforming it into your real-time analytical engine, seamlessly integrated with your Python workflows. Dive in and discover the future of data-driven decision-making.
Changing data is hard: The computer may crash, scripts could fail, and data structures could be changing. Relational data management systems provide transactional (“ACID”) guarantees that can be immensely useful for data analysis. DuckDB provides all-or-nothing semantics for changes to datasets and is robust against failures of any kind. In this talk, we will illustrate the usefulness DuckDB’s transactional facilities to bring sanity to changes to data analysis workflows in Python.
This session is a GenAI talk, where you will learn how Knowledge Graphs, Vectors and Retrieval Augmented Generation (RAG) can support your projects.
LLMs offer powerful capabilities, but deploying them effectively in production remains a challenge for conversational AI and Chatbot applications, especially when it comes to minimizing hallucinations and ensuring accurate responses. In this 90-minute hands-on tutorial, we’ll explore building conversational AI systems using CALM and Rasa. CALM (Conversational AI Language Model) combines traditional conversational AI techniques with LLMs, separating conversational ability from business logic execution to deliver reliable, cost efficient, and scalable solutions. Unlike LLMs that handle both sides of the conversation, CALM focuses on user understanding with predefined business logic. This approach not only accelerates development but also enhances cost efficiency, scalability and reliability. By focusing on predefined business logic with CALM, you’ll gain the ability to build sophisticated, scalable systems faster. You’ll also learn how to use fine-tuned, open-weight models, such as llama 8b to power your AI assistant.
Participants will learn how to use CALM for business logic and Rasa for dialogue management, with practical insights, code examples, and best practices. Materials will be provided via a GitHub repository with a GitHub Codespace for easy access and execution.
Time series analysis provides essential tools for modeling and predicting time-dependent data, especially data exhibiting seasonal patterns or serial correlation. This tutorial covers tools in the StatsModels library including seasonal decomposition and ARIMA. We'll develop the ARIMA model bottom-up, implementing it one piece at a time, and then using StatsModels. As examples, we'll look at weather data and electricity generation from renewable sources in the United States since 2004 -- but the methods we'll cover apply to many kinds of real-world time series data.
As large language models (LLMs) become increasingly integrated into industries like finance, healthcare, and law, ensuring their responsible deployment is critical—particularly in highly regulated environments. These industries face unique challenges, including data privacy, compliance with strict regulations, and minimizing the risks of biased or untrustworthy outputs.
This session will explore the complexities of using LLMs in regulated industries and present a governance framework to address these challenges. We'll cover practical solutions for deploying LLMs while adhering to industry-specific regulations, ensuring transparency, reducing bias, and maintaining data privacy. Attendees will learn how to implement governance best practices at various stages of the LLM lifecycle—from model training and validation to deployment and ongoing monitoring.
Drawing on real-world examples and lessons learned, this talk will equip data scientists, machine learning engineers, and AI leaders with actionable strategies for navigating regulatory compliance and minimizing risks, while still harnessing the full potential of LLMs to drive innovation.
Retrieval-Augmented Generation (RAG), despite being a superstar of GenAI over the last year, comes with a plethora of challenges and is prone to errors. Open Source Python libraries like RAGAS and TruLens provide frameworks for evaluating RAG systems, using various metrics that leverage LLMs to assess performance. But when using LLM in a RAG system is in itself a source of errors, it remains to be seen how reliable it would be to use another LLM, allthebit a more powerful one, as a judge of the RAG performance. This study explores various RAG evaluation metrics, as well as the choice of evaluator LLM, to examine the reliability and consistency of LLM-based evaluations. The aim is to provide practical insights and guidance for interpreting these evaluations effectively, and help users make informed decisions when applying them in diverse contexts.
A beginner level hands-on introduction to BigQuery DataFrames. Please bring your laptop! There is nothing to install in advance.
The paid search landscape is undergoing a remarkable transformation, evolving from traditional keyword-centric strategies to a more nuanced approach that prioritizes audience targeting. This shift is not just a trend; it’s a response to the ever-increasing demand for precision and effectiveness in reaching potential customers in a crowded digital marketplace.
At the forefront of this evolution is our innovative automated system designed to identify high-intent users through sophisticated batch processing of their website behaviour. By harnessing the power of machine learning, we create a dynamic layer that curates smarter audiences those that closely resemble our most valuable converted customers. This enables us to execute precise retargeting campaigns that not only drive meaningful engagement but also optimize marketing budgets, resulting in enhanced audience selection and significantly higher conversion rates.
Knowledge graphs are excellent at representing and storing heterogeneous and interconnected information in a structured manner, effectively capturing complex relationships and attributes across different data types.
Structured text generation allows for building knowledge graphs by providing neatly structured outputs, making it an ideal method for extracting structured information.
Similarly, structured text generation enables the creation of agents by defining which tools are allowed and what action inputs are permitted.
In this talk, we first build a graph database from unstructured data and then we create an agent to query the graph database. We will show these capabilities with a demo.
GraphRAG is a popular way to use KGs to ground AI apps. Most GraphRAG tutorials use LLMs to build graph automatically from unstructured data. However, what if you're working on use cases such as investigative journalism and sanctions compliance -- "catching bad guys" -- where transparency for decisions and evidence are required?
This talk explores how to leverage open data, open models, and open source to build investigative graphs which are accountable, exploring otherwise hidden relations in the data that indicate fraud or corruption. This illustrates techniques used in production use cases for anti-money laundering (AML), ultimate beneficial owner (UBO), rapid movement of funds (RMF), and other areas of sanctions compliance in general.
This approach uses Python open source libraries, e.g., the KùzuDB
graph database and LanceDB
vector database. For each NLP task we use state-of-the-art open models (mostly not LLMs) emphasizing how to tune for a domain context: named entity recognition, relation extraction, textgraph, entity linking, as well as entity resolution to merge structured data and produce a semantic overlay that organizes the graph.
The pursuit of Artificial General Intelligence (AGI) is significantly constrained by computing resources, especially during the pretraining stages of LLMs. One emerging approach to reduce such reliance, and democratise access to AI development, is a paradigm shift from “Model over Data” to "Model over Model" (MoM).
The Birds of a Feather (BoF) format fosters open discussions around topics proposed by participants.
Doing geoscience is hard. It’s even harder if you have to figure out how to handle large amounts of data!
Xarray is an open-source Python library designed to simplify the handling of labeled multi-dimensional arrays, like raster geospatial data, making it a favorite among geoscientists. It allows these scientists to easily express their computations, and is backed by Dask, a Python library for parallel and distributed computing, to scale computations to entire clusters of machines.
People love using Xarray on Dask for geospatial workloads, but only up to about the terabyte scale. At this point, the stack can struggle, requiring expertise to work well and frustrating users and developers alike.
To address this and enable the Dask ❤️ Xarray stack to scale to hundreds of terabytes, we have recently designed a suite of large-scale geospatial benchmarks. With the help of these benchmarks, we are able to understand what limits performance within Dask and Xarray, and to address these issues.
In this talk, we will explore how Dask integrates with libraries like Xarray and Zarr to scale geospatial workloads and other multi-dimensional array computations.
We will also dive deeper into some of the bottlenecks in the Dask ❤️ Xarray stack that our benchmarks revealed, as well as some of the recent improvements we have made in these areas. With the help of our benchmark suite, we then assess the impact of these changes.
Join us to discover how Dask helps you scale geoscience workloads from your laptop to the cloud.
This hands-on tutorial guides participants through the process of constructing the essential components of a Machine Learning Platform (MLP) from scratch. We'll focus on implementing five core elements: a feature store, model registry, orchestrator, inference engine, and basic monitoring system. The session emphasizes practical, hands-on coding using Test-Driven Development (TDD), Domain Driven Design, and hexagonal architecture principles providing attendees with a functional foundation for a robust ML infrastructure.
In partnership with the Department for Environment, Food and Rural Affairs (DEFRA), Datacove developed a bespoke Shiny dashboard designed to enhance decision-making in the areas of Health and Wellbeing, Nature, and Sustainability (HWNS). This presentation explores three key aspects: project and data management, customisation, and usability enhancements in R.
The project began with careful project and data management, ensuring alignment with government needs by connecting to APIs and consolidating HWNS data. In the second phase, the dashboard was seamlessly integrated into existing platforms using custom CSS and JavaScript to maintain visual consistency. Lastly, user-friendly design principles and interactive tools like Leaflet and Highcharts were employed to make the dashboard accessible to all users, regardless of their data literacy, promoting informed and localised decision-making. This collaboration with DEFRA sets a foundation for future policy innovations within the HWNS sector.
9 out of 10 engineers will recommend the use of evaluation tools for their LLMs, but admit they only trust eyeballing responses to decide whether it's safe to use. The 10th carefully studies the floor in silence.
This talk is for engineers, developers or applied researchers who may or may not know of evaluation tools and metrics, but either way benefit from an overview of different risks in applications using LLMs for text generation, Open Source libraries they can use to mitigate these risks, and examples of how to use them.
In this talk, we will explore Judea Pearl’s causal ladder (association, intervention, and counterfactuals) through the lens of a simple demand forecasting model. Using real-world business scenarios, I will demonstrate how to move beyond correlation-based predictions to more actionable decisions using PyMC’s causal inference tools. Attendees will learn how to make forecasts for natural business conditions, simulate the effects of strategic changes (like increased advertising spend), and evaluate the causal impact of past price promotion with retrodictive causal inference.
Target audience: Data scientists, machine learning engineers, and business analysts looking to improve their decision-making using causal inference.
Zoom Link: https://numfocus-org.zoom.us/j/88260275885?pwd=RW9FKYZs4uzjJHRgNn7CGOL1sVgAaH.1
Ever wanted to contribute to open source but weren't sure where to start?
In this event, we'll contribute to Narwhals, a lightweight compatibility layer between dataframes. You'll be mentored by the project's developers, and by the end of the session, you'll very likely have submitted your own pull request!
There will be plenty of issues to work on, for both beginner and advanced contributors.
Taking any project from zero to production is challenging. And Data Science has a particularly high failure rate, with a lot of ideas not getting beyond the prototype stage.
But there are real reasons for this: there is intrinsic and unknown complexity in data, and there are often big challenges knowing if we have actually solved the problem -- the answer is so rarely "yes" or "no".
In this talk I'll cover some key learnings from a decade working on DS problems at early- and later-stage startups, building products to improve product market fit.
Join us to celebrate the innovative minds behind NumHack 2024!
Many organizations are eager to build and deploy their own large language models (LLMs), but validating them can feel frustrating and incomplete. Fortunately, as data scientists we are experts in model diagnostics, and we can extend these same principles to LLM validation. In this talk, I will present a scientific approach to evaluating custom text generation models in Python across several dimensions such as safety, coherence, and correctness.
This talk will explain how to solve business forecasting problems using time series methods. Time series forecasting remains a specialty topic. Because of this you really want to use a package tuned for your use case and specialized to deal with the difficulties inherent in time series forecasting. I will share a simplified problem notation that helps you select between time series packages in R and Python.
The nvmath-python is a new way of delivering NVIDIA accelerated Math Libraries to Python users: researchers-practitioners, library and framework developers, and optimized GPU kernel developers. In this talk we will provide an introduction to the library design goals, its architecture, overview of the key features along with its usage examples.
Tenova, as an innovative engineering company, collaborates closely with its client-partners to create advanced technologies and services that optimize business operations.
This talk discusses the deployment of our image recognition system to identify and mitigate potential hazards on steel plants, specifically focusing on the detection of bulky steel pieces.
The system was deployed on-premise using an edge device and an IP camera, supported by Azure IoT Edge and a Flask API for image processing and prediction.
A recent migration to a RabbitMQ-based architecture using Pika enhanced scalability and communication.
The presentation will cover technical strategies, the challenges (like offline functionality and real-time, low-latency hazard detection) and the positive impact of the system on workplace safety and operational efficiency.
The goal of this tutorial is to make Gaussian processes (GPs) useful. In most practicing data scientists' mental map of modeling and machine learning techniques, Gaussian processes are an advanced approach that sit alone on an island, perhaps with narrow use cases like Bayesian optimization. Most books and other material on GPs tend to focus on theoretical aspects, and it can be hard to close the gap between the theory and putting those ideas into practice to solve real problems in a reasonable amount of time.
This tutorial is split into two parts. The first part introduces Bayesian modeling, focusing on hierarchical modeling and the concept of partial pooling. We’ll use the classic example of estimating the batting average of a group of baseball players as motivation. Then we’ll introduce GPs as a useful generalization of hierarchical modeling for the common situation where our groups aren’t distinct categories. Instead of thinking of each baseball player as completely distinct and exchangeable entities, we can use a GP to partially pool information locally by also considering each player's age. Finally we’ll close the first part by connecting back to the more common introduction to GPs as infinite dimensional multivariate normals.
The second part of the tutorial will give an overview of practical tips and tricks for modeling with GPs using the open source Python package PyMC. Specifically, how to address the two big issues to using GPs in practice: scaling and identifiability. We’ll discuss useful approximations like the HSGP and when to apply them, advice on when to use splines, and finally when you need to step out of a PPL like PyMC or Stan to a GP specific library like GPFlow or GPyTorch. We’ll do so with a couple motivating examples. The audience should have some familiarity with basic ML and statistics concepts, such as probability distributions, normal and multivariate normal distributions, correlation and covariance, and linear regression - but the talk will aim to be non-technical and the goal will be introduce GPs and give people the tools they need to use them effectively in practice.
This talk will tell the tale of how we migrated a data application from Streamlit to Panel. And what it took to scale from 100 users to 2000+ users in less than 2 months. It's a story of pain, Kubernetes, resilience, and a whole lot of Python.
This talk focuses on the underrepresentation of women in AI and data science, where only 22% of AI professionals are women. We will explore how addressing the missing 78% is critical to creating inclusive, innovative solutions that benefit society as a whole. Attendees will learn about the current challenges women face, the importance of diverse perspectives in AI development, and actionable strategies for empowering women in the field through community engagement, mentorship, and data-driven policies.
Have you ever wanted to understand LLM internals such as pre-training, supervised fine-tuning, instruction-tuning, reinforcement learning with human feedback, parameter efficient fine-tuning, expanding LLM context lengths, attention mechanism variants, model deployment performance, and cost optimization, which GPUs to use when and more? This talk will take an end-to-end review of the LLM training and deployment pipeline to give you both a stronger intuition and a faster path to implementation using model training and deployment frameworks.
In this talk we present the OS library Burr -- a tool that makes it easier to build reliable, production-ready AI applications and agents. We will show how to use Burr to address a host of production concerns problems including generating test data from prior runs, interactive debugging, persisting/loading application state, and more.
Today we will learn how to build an application around sensor data, REST Feeds, weather data, traffic cameras and vector data. We will write a simple Python application to collect various structured, semistructured data and unstructured data, We will process, enrich, augment and vectorize this data and insert it into a Vector Database to be used for semantic hybrid search and filtering. We will then build a Jupyter notebook to analyze, query and return this data.
Along the way we will learn the basics of Vector Databases and Milvus. While building it we will see the practical reasons we choose what indexes make sense, what to vectorize, how to query multiple vectors even when one is an image and one is text. We will see why we do filtering. We will then use our vector database of Air Quality readings to feed our LLM and get proper answers to Air Quality questions. I will show you how to all the steps to build a RAG application with Milvus, LangChain, Ollama, Python and Air Quality Reports. Finally after demos I will answer questions, provide the source code and additional resources including articles.
Vector databases are everywhere, powering LLMs. But indexing embeddings, especially multivector embeddings like ColPali and Colbert, at a bulk is memory intensive. Vector streaming solves this problem by parallelizing the tasks of parsing, chunking, and embedding generation and indexing it continuously chunk by chunk instead of bulk. This not only increase the speed but also makes the whole task more optimized and memory efficient.
The library gives many vector database supports, like Pinecone, Weavaite, and Elastic.
CSP is a newly open-sourced library for stream processing in Python. In this talk, we discuss how CSP can be leveraged to handle all stages of an online machine learning pipeline from feature generation to live training and inference.
This talk showcases and exemplifies the rapid specification and execution of Quantile Regression workflows. Various use cases are discussed, including fitting, outlier detection, conditional CDFs, and simulations, using different types of time series data.
DuckDB is revolutionizing data processing by enabling in-memory OLAP SQL operations with a lightweight, dependency-free architecture. This talk explores how DuckDB can be leveraged to handle large-scale, massively parallel data processing, ranging from hundreds of gigabytes to terabytes, outside traditional SQL and Spark warehouse systems. We will go over the integration with the Python ecosystem and demonstrate its scaling potential using the cloud compute.
This talk will uncover the power of AI in combating Amazon deforestation through an innovative cattle detection system. We present a cutting-edge approach to monitoring illegal ranching, a primary driver of deforestation, using very high-resolution satellite imagery and deep learning. We'll dive into the unique challenges of detecting cattle from space – from congested scenes with small, clustered targets to diverse and cluttered backgrounds – and how we overcame them with a two-step neural network approach. By combining classification and density estimation techniques, our model efficiently identifies potential cattle locations and estimates herd sizes across varied landscapes. Discover how this interdisciplinary project, developed in collaboration with Brazilian prosecutors, leverages data science to drive real-world impact in environmental conservation and sustainable land management. Join us to explore the intersection of computer vision, geospatial analysis, and environmental advocacy, and learn how AI can be a powerful tool in the fight against deforestation in the Amazon and beyond.
The Birds of a Feather (BoF) format fosters open discussions around topics proposed by participants.
The Birds of a Feather (BoF) format fosters open discussions around topics proposed by participants.
The Birds of a Feather (BoF) format fosters open discussions around topics proposed by participants.