LAVSS: Location-Guided Audio-Visual Spatial Audio Separation

Published 2023 in IEEE Workshop/Winter Conference on Applications of Computer Vision

ABSTRACT

Existing machine learning research has achieved promising results in monaural audio-visual separation (MAVS). However, most MAVS methods purely consider what the sound source is, not where it is located. This can be a problem in VR/AR scenarios, where listeners need to be able to distinguish between similar audio sources located in different directions. To address this limitation, we have generalized MAVS to spatial audio separation and proposed LAVSS: a location-guided audio-visual spatial audio separator. LAVSS is inspired by the correlation between spatial audio and visual location. We introduce the phase difference carried by binaural audio as spatial cues, and we utilize positional representations of sounding objects as additional modality guidance. We also leverage multi-level cross-modal attention to perform visual-positional collaboration with audio features. In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation. This exploits the correlation between monaural and binaural channels. Experiments on the FAIR-Play dataset demonstrate the superiority of the proposed LAVSS over existing benchmarks of audio-visual separation. Our project page: https://yyx666660.github.io/LAVSS/.

PUBLICATION RECORD

Publication year
2023
Venue
IEEE Workshop/Winter Conference on Applications of Computer Vision
Publication date
2023-10-31
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1109/WACV57701.2024.00542 arXiv 2310.20446
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
2023cited by this paper
AudioGen: Textually Guided Audio Generation
2022cited by this paper
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
2022cited by this paper
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
2022cited by this paper
Few-Shot Audio-Visual Learning of Environment Acoustics
2022cited by this paper
Deep Audio-Visual Beamforming for Speaker Localization
2022cited by this paper
Learning Neural Acoustic Fields
2022cited by this paper
SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation
2022cited by this paper
VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer
2022cited by this paper
Active Audio-Visual Separation of Dynamic Sound Sources
2022cited by this paper
Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
2022cited by this paper
iQuery: Instruments as Queries for Audio-Visual Sound Separation
2022cited by this paper
Visually Informed Binaural Audio Generation without Binaural Audios
2021cited by this paper
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
2021cited by this paper
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
2021cited by this paper
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation
2021influential reference
Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention
2021cited by this paper
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video
2021cited by this paper
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
2021cited by this paper
Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
2021cited by this paper
Visual Scene Graphs for Audio Source Separation
2021cited by this paper
The Right to Talk: An Audio-Visual Transformer Approach
2021cited by this paper
Three-Dimensional Speaker Localization: Audio-Refined Visual Scaling Factor Estimation
2021cited by this paper
Blind and Neural Network-Guided Convolutional Beamformer for Joint Denoising, Dereverberation, and Source Separation
2021cited by this paper
Move2Hear: Active Audio-Visual Source Separation
2021cited by this paper
Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
2021cited by this paper
Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations
2021cited by this paper
Deep Audio-visual Learning: A Survey
2020cited by this paper
Music Gesture for Visual Sound Separation
2020cited by this paper
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network
2020cited by this paper
Multiple Sound Sources Localization from Coarse to Fine
2020cited by this paper
Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
2020influential reference
Foley Music: Learning to Generate Music from Videos
2020cited by this paper
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
2020cited by this paper
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
2020cited by this paper
Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation
2019cited by this paper
Self-Supervised Moving Vehicle Tracking With Stereo Sound
2019cited by this paper
Dual Attention Matching for Audio-Visual Event Localization
2019cited by this paper
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
2019cited by this paper
Recursive Visual Sound Separation Using Minus-Plus Net
2019cited by this paper
End-to-End Multi-Channel Speech Separation
2019cited by this paper
The Sound of Motions
2019cited by this paper
Co-Separating Sounds of Visual Objects
2019influential reference
SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition
2019cited by this paper
Audio-Visual Event Localization in Unconstrained Videos
2018cited by this paper
Learning to Localize Sound Source in Visual Scenes
2018cited by this paper
2.5D Visual Sound
2018influential reference
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
2018cited by this paper
Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition
2018cited by this paper
Self-Supervised Generation of Spatial Audio for 360 Video
2018cited by this paper
Listen and Look: Audio–Visual Matching Assisted Speech Source Separation
2018cited by this paper
Scene-aware audio for 360° videos
2018cited by this paper
Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation
2018cited by this paper
The Sound of Pixels
2018influential reference
Learning to Separate Object Sounds by Watching Unlabeled Video
2018influential reference
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
2018cited by this paper
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
2017cited by this paper
Attention is All you Need
2017cited by this paper
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
2017cited by this paper
Visual to Sound: Generating Natural Sound for Videos in the Wild
2017cited by this paper
CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation
2017cited by this paper
U-Net: Convolutional Networks for Biomedical Image Segmentation
2015influential reference
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Learning Deep Features for Discriminative Localization
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
MIR_EVAL: A Transparent Implementation of Common MIR Metrics
2014cited by this paper
Consistent Wiener Filtering for Audio Source Separation
2013cited by this paper
Reverberant Audio Source Separation via Sparse and Low-Rank Modeling
2013cited by this paper
Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria
2007cited by this paper
Speech segregation based on sound localization
2001cited by this paper
One Microphone Source Separation
2000cited by this paper
Video
1999cited by this paper
Factorial Hidden Markov Models
1995cited by this paper
Signal estimation from modified short-time Fourier transform
1983cited by this paper
Our Perception of the Direction of a Source of Sound
year unknowncited by this paper
On Our Perception of the Direotion of a Source of Sound
year unknowncited by this paper

CITED BY

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
2025cites this paper
SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
2025cites this paper
Location-Oriented Sound Event Localization and Detection with Spatial Mapping and Regression Localization
2025cites this paper
UWAV: Uncertainty-Weighted Weakly-Supervised Audio-Visual Video Parsing
2025cites this paper
ASAudio: A Survey of Advanced Spatial Audio Research
2025influential citation
Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
2024cites this paper
Continual Audio-Visual Sound Separation
2024cites this paper
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
2024cites this paper
TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information
2024cites this paper
SOAF: Scene Occlusion-aware Neural Acoustic Field
2024cites this paper
Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
2024cites this paper
Learning Continual Audio-Visual Sound Separation Models
year unknowncites this paper