12-03, 16:00–16:30 (UTC), AI/ML Track
This talk will cover how to use pre-trained HuggingFace models, specifically wav2vec 2.0 and WavLM, to detect audio deepfakes. These deepfakes, made possible by advanced voice cloning tools like ElevenLabs and Respeecher, present risks in areas like misinformation, fraud, and privacy violations. The session will introduce deepfake audio, discuss current trends in voice cloning, and provide a hands-on tutorial for using these transformer-based models to identify synthetic voices by spotting subtle anomalies. Participants will learn how to set up these models, analyze deepfake audio datasets, and assess detection performance, bridging the gap between speech generation and detection technologies.
In this talk, we will explore how to leverage off-the-shelf HuggingFace models for detecting audio deepfakes, focusing on state-of-the-art models like wav2vec 2.0 and WavLM. Audio deepfakes, generated by advanced voice cloning technologies, have become a significant concern due to their potential for misuse in areas like misinformation, fraud, and privacy breaches. Tools such as ElevenLabs and Respeecher now enable highly realistic voice replication, making detection technologies more crucial than ever. The tutorial will guide participants through setting up and using pre-trained models for identifying deepfake audio. Both models are built on transformer architectures and have shown strong performance in speech-related tasks, making them suitable candidates for detecting subtle anomalies in synthetic voices.
We will start by introducing the fundamentals of deepfake audio and discuss current trends in voice cloning technologies, highlighting the rise of commercial tools that allow easy generation of cloned voices. The hands-on session will focus on practical implementations. By the end of the tutorial, participants will have a clear understanding of how to utilise these pre-trained models from HuggingFace’s library, extract their representations on labelled deepfake datasets, and evaluate their effectiveness in detecting manipulated audio. Through this process, we aim to bridge the gap between cutting-edge speech generation technologies and the emerging need for highly-accurate deepfake detection systems.
Previous knowledge expected