12-04, 16:30–17:00 (UTC), LLM Track
"What training data do you need, don't you just train on the whole internet?"
"Doesn't data production rely heavily on outsourcing to cheap labour markets in the Global South?"
"Isn't all training data just synthetic nonsense generated by LLMs nowadays, how can you expect a model to learn anything worthwhile?"
These are all questions that I regularly get, when I tell people I work on building foundational LLMs. Because as often as we use LLMs in our daily lives nowadays, people generally know very little of the data that went into the LLM to train it.
In this talk, I'll address these questions and hope to build an understanding of what it takes to build an LLM from scratch, from a data perspective.
"What training data do you need, don't you just train on the whole internet?"
"Doesn't data production rely heavily on outsourcing to cheap labour markets in the Global South?"
"Isn't all training data just synthetic nonsense generated by LLMs nowadays, how can you expect a model to learn anything worthwhile?"
These are all questions that I regularly get, when I tell people I work on building foundational LLMs. Because as often as we use LLMs in our daily lives nowadays, people generally know very little of the data that went into the LLM to train it.
In this talk, I'll address these questions and hope to build an understanding of what it takes to build an LLM from scratch, from a data perspective.
We'll cover, among other things:
- The difference between pre-training and post-training
- How we teach a model language understanding, how to follow instructions and what a good response looks like
- How we heavily rely on humans-in-the-loop and models-in-the-loop.
No previous knowledge expected
Marysia Winkels is a member of technical staff at Cohere, where she focusses on creating high-quality diverse data for the post-train data team. Besides that, she is also chair for PyData Amsterdam, and volunteers at CorrelAid - a #data4good organisation that matches data enthusiasts to non-profits that benefit from their help.