VoxCeleb: A Large-Scale Speaker Identification Dataset

Arsha Nagrani,Joon Son Chung,Andrew Zisserman

Published 2017 in Interspeech

ABSTRACT

Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected 'in the wild'. We make two contributions. First, we propose a fully automated pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains hundreds of thousands of 'real world' utterances for over 1,000 celebrities. Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance. We show that a CNN based architecture obtains the best performance for both identification and verification.

PUBLICATION RECORD

Publication year
2017
Venue
Interspeech
Publication date
2017-06-26
Fields of study
Computer Science
Identifiers
DOI 10.21437/Interspeech.2017-950 arXiv 1706.08612
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Front-End Factor Analysis For Speaker Verification
2018cited by this paper
Speaker identification and clustering using convolutional neural networks
2016cited by this paper
Out of Time: Automated Lip Sync in the Wild
2016cited by this paper
Cross-Modal Supervision for Learning Active Speaker Detection in Video
2016cited by this paper
The Speakers in the Wild (SITW) Speaker Recognition Database
2016influential reference
Lip Reading in the Wild
2016cited by this paper
CNN architectures for large-scale audio classification
2016cited by this paper
The MGB challenge: Evaluating multi-genre broadcast media recognition
2015cited by this paper
Learning the speech front-end with raw waveform CLDNNs
2015cited by this paper
Deep Face Recognition
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
NFI-FRITS: A forensic speaker recognition database and some first experiments
2014cited by this paper
MatConvNet: Convolutional Neural Networks for MATLAB
2014cited by this paper
Return of the Devil in the Details: Delving Deep into Convolutional Nets
2014cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
One millisecond face alignment with an ensemble of regression trees
2014cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Front-End Factor Analysis for Speaker Verification
2011cited by this paper
Finding Difficult Speakers in Automatic Speaker Recognition
2011cited by this paper
MOBIO Database for the ICPR 2010 Face and Speech Competition
2009cited by this paper
Dlib-ml: A Machine Learning Toolkit
2009cited by this paper
Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms
2006cited by this paper
Probabilistic Linear Discriminant Analysis
2006influential reference
The MIT Mobile Device Speaker Verification Corpus: Data Collection and Preliminary Experiments
2006cited by this paper
Learning a similarity metric discriminatively, with application to face verification
2005cited by this paper
The AMI meeting corpus
2005cited by this paper
A New Database for Speaker Recognition
2005cited by this paper
The ICSI Meeting Corpus
2003cited by this paper
High performance digit recognition in real car environments
2002cited by this paper
Eurospeech 2001-Scandinavia Robust Speech Recognition in Noise : An Evaluation using the SPINE Corpus †
2001cited by this paper
POLYCOST: A telephone-speech database for speaker recognition
2000cited by this paper
Speaker Verification Using Adapted Gaussian Mixture Models
2000cited by this paper
Robust text-independent speaker identification using Gaussian mixture speaker models
1995cited by this paper
The Australian National Database of Spoken Language
1994cited by this paper
DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1
1993cited by this paper
SWITCHBOARD: telephone speech corpus for research and development
1992cited by this paper
ARTICLE IN PRESS Image and Vision Computing xxx (2008) xxx–xxx Contents lists available at ScienceDirect Image and Vision Computing
year unknowncited by this paper
The NIST Year 2012 Speaker Recognition Evaluation Plan 1 I
year unknowncited by this paper

CITED BY

VividFace: Real-Time and Realistic Facial Expression Shadowing for Humanoid Robots
2026cites this paper
MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization
2026cites this paper
A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model
2026cites this paper
VedicTHG: Symbolic Vedic Computation for Low-Resource Talking-Head Generation in Educational Avatars
2026cites this paper
Branch-MFA-TDNN: A Parallel Branch Speaker Verification Model for Voice IoT
2026cites this paper
Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification
2026cites this paper
DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification
2026cites this paper
DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding
2026cites this paper
A Thorough Literature Review on Automatic Speaker Diarization Employing Machine Learning and Deep Learning Methodologies
2026cites this paper
TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
2026cites this paper
Face-Voice Association with Inductive Bias for Maximum Class Separation
2026cites this paper
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
2026cites this paper
CoFaCo: Controllable Generative Talking Face Video Coding
2026cites this paper
TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice
2026cites this paper
Judge Model for Large-scale Multimodality Benchmarks
2026cites this paper
Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
2026cites this paper
Beyond Lips: Integrating Gesture and Lip Cues for Robust Audio-visual Speaker Extraction
2026cites this paper
An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems
2026cites this paper
Multidimensional acoustic feature fusion with attention-guided light weighted CNN for improved speaker recognition
2026cites this paper
Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation
2026cites this paper
Scores Know Bobs Voice: Speaker Impersonation Attack
2026cites this paper
PoseTalk: Exploring Text- and Audio-Based Pose Control for One-Shot Talking Face Generation
2026cites this paper
PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models
2026cites this paper
U3-xi: Pushing the Boundaries of Speaker Recognition via Incorporating Uncertainty
2026cites this paper
Quantifying the Relationship Between Speech Quality Metrics and Biometric Speaker Recognition Performance Under Acoustic Degradation
2026cites this paper
Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis
2026cites this paper
Video Face Super-Resolution With High-Precision Identity Preservation
2026cites this paper
On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation
2026cites this paper
Over-the-Air Adversarial Attacks and Detection for Automatic Speaker Verification
2026cites this paper
The Achilles' Heel of Angular Margins: A Chebyshev Polynomial Fix for Speaker Verification
2026cites this paper
UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
2026cites this paper
PhonoMetric: a dual-metric engine for real-time English language accent evaluation and personalized speech training for Indian learners
2026cites this paper
ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios
2026cites this paper
Trainable multi-channel front-ends for joint beamforming and speaker embedding extraction
2026cites this paper
Triage knowledge distillation for speaker verification
2026cites this paper
Segment Length Matters: A Study of Segment Lengths on Audio Fingerprinting Performance
2026cites this paper
Voiceprint-Based Hardware Authentication System with Spoof Detection
2026cites this paper
Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM
2026cites this paper
Covo-Audio Technical Report
2026cites this paper
Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization
2026cites this paper
A Data-Oriented Conceptual Model and Hybrid TDNN–BiLSTM Framework for Context-Aware Speaker Verification in Smart Environments
2026influential citation
ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis
2026cites this paper
From Unimodal to Flexible: A Survey of Generalized Biometric Systems
2026cites this paper
Single Microphone Own Voice Detection based on Simulated Transfer Functions for Hearing Aids
2026cites this paper
UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation
2026cites this paper
What Do Neurons Listen To? A Neuron-level Dissection of a General-purpose Audio Model
2026cites this paper
TidyVoice 2026 Challenge Evaluation Plan
2026cites this paper
Introducing voice timbre attribute detection
2025cites this paper
Speech-aided facial video super resolution with accurate lip motion and enhanced frequency details
2025cites this paper
Toward almost-zero fault acceptance of deep learning-based voice authentication using small training dataset
2025cites this paper
MGFF-TDNN: A Multi-Granularity Feature Fusion TDNN Model with Depth-Wise Separable Module for Speaker Verification
2025cites this paper
MagicFace: Slot-Driven High-Fidelity One-Shot Facial Appearance Editing
2025cites this paper
Combining enhanced DINO with prototypical networks for self-supervised speaker verification
2025cites this paper
Speaker Embeddings to Improve Tracking of Intermittent and Moving Speakers
2025cites this paper
X2C: A Dataset Featuring Nuanced Facial Expressions for Realistic Humanoid Imitation
2025cites this paper
SUPERB-EP: Evaluating Encoder Pooling Techniques in Self-Supervised Learning Models for Speech Classification
2025cites this paper
Speaker Retrieval in the Wild: Challenges, Effectiveness and Robustness
2025cites this paper
Voice Cloning: Comprehensive Survey
2025cites this paper
Semantic-Aware Source Coding for Talking-Head Video with Adaptive Keypoints
2025cites this paper
Robust Speaker Recognition for Whispered Speech
2025cites this paper
Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding
2025cites this paper
The Voice Timbre Attribute Detection 2025 Challenge Evaluation Plan
2025cites this paper
Target sample mining with modified activation residual network for speaker verification
2025cites this paper
Spatial Audio Processing with Large Language Model on Wearable Devices
2025cites this paper
CNN LIPNET : Automated Lip Reading Using Deep Convolutional Neural Networks
2025cites this paper
Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation
2025cites this paper
Towards efficient real-time video motion transfer via generative time series modeling
2025cites this paper
Geometric transformation supervised disentanglement of pose and expression for talking face generation
2025cites this paper
Anonymization Techniques for Behavioral Biometric Data: A Survey
2025cites this paper
M2D-CLAP: Exploring General-Purpose Audio-Language Representations Beyond CLAP
2025cites this paper
Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification
2025cites this paper
Quality Assurance Framework for Multimodal Assessment Datasets on AI Risk Factors
2025cites this paper
Generalized Score Comparison-Based Learning Objective for Deep Speaker Embedding
2025cites this paper
PixSwap: High-Resolution Face Swapping for Effective Reflection of Identity via Pixel-Level Supervision with Synthetic Paired Dataset
2025cites this paper
SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors
2025cites this paper
Multimodal learning-based speech enhancement and separation, recent innovations, new horizons, challenges and real-world applications
2025cites this paper
Leveraging distance information for generalized spoofing speech detection
2025cites this paper
Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition
2025cites this paper
Text-dependent Speaker Verification Challenge 2024: Exploring Shared and User-defined Passphrases
2025cites this paper
Efficient Extreme Large-Scale Speaker Verification: Dynamic Active Sub Fully-Connected Layers for Faster Training and Memory Optimization
2025cites this paper
Codec-ASV: Exploring Neural Audio Codec For Speaker Representation Learning
2025cites this paper
Denoising Student Features with Diffusion Models for Knowledge Distillation in Speaker Verification
2025cites this paper
Semi-Supervised Speaker Diarization Using Graph Transformers and LLMs on Naturalistic Apollo 11 Data
2025cites this paper
Aligning Noisy-Clean Speech Pairs at Feature and Embedding Levels for Learning Noise-Invariant Speaker Representations
2025cites this paper
In Search of Optimal Pretraining Strategy for Robust Speaker Recognition
2025cites this paper
Improved Cross-Lingual Speaker Verification Using Speaker Sensitive Feature Guidance and Fine-grained Phonetic Information
2025cites this paper
Reconstructing voice identity from noninvasive auditory cortex recordings
2025cites this paper
SoCov: Semi-Orthogonal Parametric Pooling of Covariance Matrix for Speaker Recognition
2025cites this paper
Trainable Adaptive Score Normalization for Automatic Speaker Verification
2025influential citation
LAVViT: Latent Audio-Visual Vision Transformers for Speaker Verification
2025cites this paper
Grouped Knowledge Distillation with Adaptive Logit Softening for Speaker Recognition
2025cites this paper
Enhancing Speaker Identification System Based on MFCC Feature Extraction and Gated Recurrent Unit Network
2025cites this paper
Stable Extended U-Net for Noise-Robust Speaker Verification
2025cites this paper
Learning Strategy with Barlow Twins Objective for Emotion-Robust Speaker Verification System
2025cites this paper
Evidential Neural GPLDA: A Novel Approach to Quantify Prediction Uncertainty in Speaker Verification Systems
2025cites this paper
AdaptiveDrop: A Simple Adaptive Label Noise Filtering Scheme for Enhanced Self-supervised Speaker Verification
2025cites this paper
Dense-Fusion2Net a more efficient and lightweight short speech speaker recognition system with time-frequency channel attention
2025cites this paper
DisFlowEm : One-Shot Emotional Talking Head Generation Using Disentangled Pose and Expression Flow-Guidance
2025cites this paper
CAARMA: Class Augmentation with Adversarial Mixup Regularization
2025influential citation
Fine-portraitist: Visualizing the Speaker’s Face Portrait during Speech Listening
2025cites this paper