Marysia Winkels
Marysia Winkels is a member of technical staff at Cohere, where she focusses on creating high-quality diverse data for the post-train data team. Besides that, she is also chair for PyData Amsterdam, and volunteers at CorrelAid - a #data4good organisation that matches data enthusiasts to non-profits that benefit from their help.
Sessions
"What training data do you need, don't you just train on the whole internet?"
"Doesn't data production rely heavily on outsourcing to cheap labour markets in the Global South?"
"Isn't all training data just synthetic nonsense generated by LLMs nowadays, how can you expect a model to learn anything worthwhile?"
These are all questions that I regularly get, when I tell people I work on building foundational LLMs. Because as often as we use LLMs in our daily lives nowadays, people generally know very little of the data that went into the LLM to train it.
In this talk, I'll address these questions and hope to build an understanding of what it takes to build an LLM from scratch, from a data perspective.