PyData Global 2024

Trustworthy LLMs: Vibe checks are not all you need
12-05, 16:00–16:30 (UTC), LLM Track

9 out of 10 engineers will recommend the use of evaluation tools for their LLMs, but admit they only trust eyeballing responses to decide whether it's safe to use. The 10th carefully studies the floor in silence.

This talk is for engineers, developers or applied researchers who may or may not know of evaluation tools and metrics, but either way benefit from an overview of different risks in applications using LLMs for text generation, Open Source libraries they can use to mitigate these risks, and examples of how to use them.


We've all heard about hallucinations, jailbreaks and some other not-so-fun ways LLMs break. The good news is there are Open Source tools to mitigate that.

This session is for developers who are now looking at taking their applications to production, want to be systematic in their model selection and sleep better when they use Open Source tools that anyone can audit and highlight issues in.

You will see an overview of risks to consider in your LLM-enabled application and recommended techniques to assess how vulnerable your model is. You will also take away practical examples based only on Open Source software.

The talk assumes some familiarity with how LLMs work and some coding background - not necessarily in Machine Learning. If you understand why evaluating generative AI models is hard, but cannot code BLEU from scratch, you will fit right in. If you can write up BLEU, but never heard of "red-teaming", this is also for you. If you wonder why this matters - the first few minutes are for you.


Prior Knowledge Expected

Previous knowledge expected

Irina is an ML Engineer, specialised in Computer Vision and NLP, and seasoned in different industries: from optical biopsy systems in France to Augmented Reality apps in German startups to leading AI Engineering teams at Siemens Mobility. She is now part of the journey of mozilla.ai to add transparency and safety to Generative AI through Open Source Software.

Even more than waking up Skynet, she's more worried about Natural Intelligence and its decisions over our data.