PyData Global 2024

Evaluating RAGs: On the correctness and coherence of Open Source eval metrics
12-05, 14:00–14:30 (UTC), General Track

Retrieval-Augmented Generation (RAG), despite being a superstar of GenAI over the last year, comes with a plethora of challenges and is prone to errors. Open Source Python libraries like RAGAS and TruLens provide frameworks for evaluating RAG systems, using various metrics that leverage LLMs to assess performance. But when using LLM in a RAG system is in itself a source of errors, it remains to be seen how reliable it would be to use another LLM, allthebit a more powerful one, as a judge of the RAG performance. This study explores various RAG evaluation metrics, as well as the choice of evaluator LLM, to examine the reliability and consistency of LLM-based evaluations. The aim is to provide practical insights and guidance for interpreting these evaluations effectively, and help users make informed decisions when applying them in diverse contexts.


Retrieval-Augmented Generation (RAG), despite being a superstar of GenAI over the last year, comes with a plethora of challenges and is prone to errors. Open Source Python libraries like RAGAS and TruLens provide frameworks for evaluating RAG systems, using various metrics that leverage LLMs to assess performance. But when using LLM in a RAG system is in itself a source of errors, it remains to be seen how reliable it would be to use another LLM, allthebit a more powerful one, as a judge of the RAG performance. This study explores various RAG evaluation metrics, as well as the choice of evaluator LLM, to examine the reliability and consistency of LLM-based evaluations. The aim is to provide practical insights and guidance for interpreting these evaluations effectively, and help users make informed decisions when applying them in diverse contexts.


Prior Knowledge Expected

Previous knowledge expected

Nour leads the Generative AI technical group at Modus Create. She has a PhD in Machine Learning and has worked on Machine Learning, Data Science and Data Engineering problems in various domains, both inside and outside Academia.

Having spending most of his career as an academic mathematician, Joe made
the leap to full-time software development in 2022. An open-source
hobbyist for decades, he discovered Rust before it hit 1.0 and fell in love
(or, at least, infatuation). Although he is capable of getting things done,
Joe also likes to talk and learn about math, software, and the connections
between them. If you have an hour to spare, try asking him about the math
behind soap bubble clusters.

Joe spends most of his non-working hours shuttling his daughters around in
his bakfiets. People in Texas (where he lives) find this odd, but they seem
to understand when he points out that it's basically the bicycle equivalent
of an F-150.