Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

Dogucan Yaman,Fevziye Irem Eyiokur,H. K. Ekenel,Alexander Waibel

Published 2025 in arXiv.org

ABSTRACT

Video editing-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leakage, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-05
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2511.08613 arXiv 2511.08613
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation
2025cited by this paper
KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution
2025cited by this paper
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers
2025cited by this paper
MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling
2024cited by this paper
EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion
2024cited by this paper
Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation
2024cited by this paper
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
2024cited by this paper
LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision
2024cited by this paper
Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck
2024cited by this paper
EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model
2024cited by this paper
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation
2023cited by this paper
SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning
2023cited by this paper
Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization
2023cited by this paper
LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook
2023cited by this paper
Audio-Driven Talking Face Generation with Stabilized Synchronization Loss
2023influential reference
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
2023cited by this paper
DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video
2023cited by this paper
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
2023influential reference
VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
2022cited by this paper
Face-Dubbing++: LIP-Synchronous, Voice Preserving Translation Of Videos
2022cited by this paper
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
2022cited by this paper
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
2022cited by this paper
Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset
2021cited by this paper
A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild
2020influential reference
MakeItTalk: Speaker-Aware Talking Head Animation
2020cited by this paper
Everybody’s Talkin’: Let Me Talk as You Want
2020cited by this paper
Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss
2019cited by this paper
Deep Audio-Visual Speech Recognition
2018cited by this paper
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
2018cited by this paper
Low-Latency Neural Speech Translation
2018cited by this paper
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
2017cited by this paper
Out of Time: Automated Lip Sync in the Wild
2016cited by this paper
Lip Reading in the Wild
2016cited by this paper
Hybrid offline / online speech translation system
2014cited by this paper
Image quality assessment: from error visibility to structural similarity
2004cited by this paper
Face translation: A multimodal translation agent
1999cited by this paper
Lip movement synthesis from speech based on hidden Markov models
1998cited by this paper

CITED BY

No citing papers are available for this paper.