PyData Global 2024

The Hidden Costs of Data Quality - Tackling Common Data Challenges in ML
12-04, 12:30–13:00 (UTC), Data/ Data Science Track

Data quality is a crucial factor that significantly impacts the performance of machine learning models. However, many data scientists often overlook or underestimate the hidden costs associated with poor data quality. This talk will highlight common data challenges, and discuss their implications for model accuracy and reliability. Attendees will learn practical strategies to identify, assess, and improve data quality, ensuring their machine learning projects yield better results.


In the world of machine learning, data is the foundation upon which models are built. Unfortunately, data is often messy and unreliable, leading to significant challenges that can derail even the best-intentioned projects. This session will dive deep into the common data quality issues that practitioners encounter, such as missing data, outliers, and formatting inconsistencies.

We'll explore the hidden costs of these data issues, including wasted time, inaccurate models, and the potential for poor decision-making based on flawed analysis. By understanding the importance of data quality, attendees will be better equipped to tackle these challenges head-on.
Through real-world examples and case studies, participants will see firsthand how addressing data quality can lead to more reliable models and successful projects.

Talk Outline:
Introduction to Data Quality in Machine Learning (5 minutes)
Understanding the Hidden Costs (5 minutes)
Common Data Challenges and Solutions (5 minutes)
Best Practices(5 minutes)
Case Studies and Real-World Applications (5 minutes)
Q&A (5minutes)


Prior Knowledge Expected

Previous knowledge expected

Kalyan is a Data and AI scientist with a background as a former data science and analytics manager, effectively balancing both academia and industry . He is a community leader and an active contributor to the Python, data science, and scientific communities.