Nour El Mawass
Nour leads the Generative AI technical group at Modus Create. She has a PhD in Machine Learning and has worked on Machine Learning, Data Science and Data Engineering problems in various domains, both inside and outside Academia.
Sessions
Retrieval-Augmented Generation (RAG), despite being a superstar of GenAI over the last year, comes with a plethora of challenges and is prone to errors. Open Source Python libraries like RAGAS and TruLens provide frameworks for evaluating RAG systems, using various metrics that leverage LLMs to assess performance. But when using LLM in a RAG system is in itself a source of errors, it remains to be seen how reliable it would be to use another LLM, allthebit a more powerful one, as a judge of the RAG performance. This study explores various RAG evaluation metrics, as well as the choice of evaluator LLM, to examine the reliability and consistency of LLM-based evaluations. The aim is to provide practical insights and guidance for interpreting these evaluations effectively, and help users make informed decisions when applying them in diverse contexts.