PyData Global 2024

Putting the data science back into LLM evaluation
12-05, 17:00–17:30 (UTC), LLM Track

Many organizations are eager to build and deploy their own large language models (LLMs), but validating them can feel frustrating and incomplete. Fortunately, as data scientists we are experts in model diagnostics, and we can extend these same principles to LLM validation. In this talk, I will present a scientific approach to evaluating custom text generation models in Python across several dimensions such as safety, coherence, and correctness.


Large language models (LLMs) are powerful human-to-machine interfaces but they also have the potential to produce incorrect, confusing, or dangerous output. Validating these models is notoriously difficult because they behave stochastically and need to be evaluated for quality as well as accuracy.

Foundational models such as Llama, GPT, or Claude are validated using standardized benchmarks, many human evaluators, and even "evaluator" LLMs. However, organizations may struggle to apply these methods to custom models that are trained on specific tasks or domains. This is because validating these outputs requires domain-specific knowledge and correctness is highly contextual.

This talk introduces a science-first approach to evaluating custom LLMs. This involves going back to earlier days of data science: identifying tangible evaluation metrics, creating repeatable experiments, and comparing performance across model versions. I will also discuss the importance of data quality and human feedback in the context of LLM evaluation.

This talk is for all data scientists and engineers who are interested in building LLM-oriented applications.

Outline

  1. The perils of LLM-based applications (5 min)
    - Correctness
    - Sensitive and PII outputs
    - Hallucinations and confabulations

  2. The problems with LLM evaluation (5 min)
    - Variable/inconsistent output
    - Response quality
    - Detecting bias and confabulations
    - Cost and latency

  3. A science-first approach to evaluating LLMs (10 min)
    - What can we automate?
    - What can't we automate?
    - Creating repeatable experiments
    - Qualitative analytics
    - Assessing human feedback: like/dislike vs. preferential voting

  4. Creating an open source Python tool for evaluation (10 min)
    - Prompt management
    - Automating qualitative analytics
    - Collecting human feedback


Prior Knowledge Expected

No previous knowledge expected

Patrick Deziel is a Python and Go developer and machine learning specialist. He has extensive experience building practical machine learning models and integrating them into existing applications. Patrick currently works at Rotational Labs where he develops custom LLMs and AI/ML-powered APIs for business use cases. In his free time, he enjoys rock climbing and contributing to open source.