Learning Representations from Audio-Visual Spatial Alignment

Published 2020 in Neural Information Processing Systems

ABSTRACT

We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, their training objectives disregard spatial cues naturally occurring in audio and visual signals. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360{\deg} video and spatial audio. The ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360{\deg} video using a transformer architecture to combine representations from multiple viewpoints. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks, including audio-visual correspondence, spatial alignment, action recognition, and video semantic segmentation.

PUBLICATION RECORD

Publication year
2020
Venue
Neural Information Processing Systems
Publication date
2020-11-03
Fields of study
Computer Science
Identifiers
arXiv 2011.01819
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

What You See Is What You Hear
2020cited by this paper
Contrastive Learning with Adversarial Examples
2020cited by this paper
A Simple Framework for Contrastive Learning of Visual Representations
2020cited by this paper
Audio-Visual Instance Discrimination with Cross-Modal Agreement
2020influential reference
Music Gesture for Visual Sound Separation
2020cited by this paper
The Sound of Motions
2019cited by this paper
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
2019cited by this paper
Momentum Contrast for Unsupervised Visual Representation Learning
2019cited by this paper
Co-Separating Sounds of Visual Objects
2019cited by this paper
Video Representation Learning by Dense Predictive Coding
2019cited by this paper
Contrastive Multiview Coding
2019cited by this paper
Self-Supervised Learning of Pretext-Invariant Representations
2019cited by this paper
The Sound of Pixels
2018cited by this paper
Invariant Information Distillation for Unsupervised Image Segmentation and Clustering
2018cited by this paper
Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos
2018cited by this paper
Deep Clustering for Unsupervised Learning of Visual Features
2018cited by this paper
Learning to Localize Sound Source in Visual Scenes
2018cited by this paper
Representation Learning with Contrastive Predictive Coding
2018cited by this paper
Invariant Information Clustering for Unsupervised Image Classification and Segmentation
2018cited by this paper
Panoptic Segmentation
2018cited by this paper
Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360° Panoramic Imagery
2018cited by this paper
2.5D Visual Sound
2018cited by this paper
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
2018cited by this paper
Self-Supervised Generation of Spatial Audio for 360 Video
2018influential reference
Saliency Detection in 360 ^\circ ∘ Videos
2018cited by this paper
A Subjective Study of Viewer Navigation Behaviors When Watching 360-Degree Videos on Computers
2018cited by this paper
Self-Supervised Learning of Depth and Camera Motion from 360° Videos
2018cited by this paper
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
2018influential reference
Unsupervised Feature Learning via Non-parametric Instance Discrimination
2018cited by this paper
Learning to Separate Object Sounds by Watching Unlabeled Video
2018cited by this paper
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
2018influential reference
The Kinetics Human Action Video Dataset
2017cited by this paper
A Closer Look at Spatiotemporal Convolutions for Action Recognition
2017cited by this paper
Look, Listen and Learn
2017influential reference
A Public Database of Immersive VR Videos with Corresponding Ratings of Arousal, Valence, and Correlations between Head Movements and Self Report Measures
2017cited by this paper
Attention is All you Need
2017cited by this paper
Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos
2017cited by this paper
Objects that Sound
2017influential reference
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2016cited by this paper
SoundNet: Learning Sound Representations from Unlabeled Video
2016cited by this paper
Unsupervised Learning of Spoken Language with Visual Context
2016cited by this paper
Layer Normalization
2016cited by this paper
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
2016cited by this paper
Feature Pyramid Networks for Object Detection
2016influential reference
Ambient Sound Provides Supervision for Visual Learning
2016cited by this paper
Pano2Vid: Automatic Cinematography for Watching 360° Videos
2016cited by this paper
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
2016cited by this paper
Out of Time: Automated Lip Sync in the Wild
2016cited by this paper
Top-down modulation in the infant brain: Learning-induced expectations rapidly affect the sensory cortex at 6 months
2015cited by this paper
Unsupervised Visual Representation Learning by Context Prediction
2015cited by this paper
Report Decoding Sound and Imagery Content in Early Visual Cortex
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
Spatial transformations for the enhancement of Ambisonic recordings
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Recognizing scene viewpoint using panoramic place representation
2012cited by this paper
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012influential reference
HMDB: A large video database for human motion recognition
2011influential reference
Sparse Coding Of Time-Varying Natural Images
2010cited by this paper
Curriculum learning
2009cited by this paper
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition
2007cited by this paper
Dimensionality Reduction by Learning an Invariant Mapping
2006cited by this paper
Efficient sparse coding algorithms
2006cited by this paper
Factors influencing audiovisual fission and fusion illusions.
2004cited by this paper
Listen and learn.
2002cited by this paper
Illusions: What you see is what you hear
2000cited by this paper
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
1996cited by this paper
Learning Classification with Unlabeled Data
1993cited by this paper
Hearing lips and seeing voices
1976cited by this paper
Periphony: With-Height Sound Reproduction
1973cited by this paper
S UPPLEMENTARY M ATERIAL OF Self-Supervised Relative Depth Learning for Urban Scene Understanding
year unknowncited by this paper
Saliency Detection in 360 ◦ Videos
year unknowncited by this paper

CITED BY

DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos
2026cites this paper
IntraCross: Cross-modality graph matching for intravascular sequence registration.
2026cites this paper
Segmenting Collision Sound Sources in Egocentric Videos
2025cites this paper
SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation
2025cites this paper
Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
2025cites this paper
Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>∘</mml:mo></mml:msup></mml:math><inline-graphic xlink:href="erdem-ieq1-3604091.gif"/></alternatives></inline-
2025cites this paper
Self-Supervised Cross-Modal Learning for Image-to-Point Cloud Registration
2025cites this paper
Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
2025influential citation
ASAudio: A Survey of Advanced Spatial Audio Research
2025influential citation
Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video
2025cites this paper
MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses
2025cites this paper
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
2025cites this paper
Supervising Sound Localization by In-the-wild Egomotion
2025cites this paper
Event-level multimodal feature fusion for audio-visual event localization
2025cites this paper
Transformers in speech processing: Overcoming challenges and paving the future
2025cites this paper
ViSAGe: Video-to-Spatial Audio Generation
2025cites this paper
Empowering smallholder olive growers in northwest Tunisia through an agroecological business model
2025cites this paper
Multimodal Perception for Goal-oriented Navigation: A Survey
2025cites this paper
Disentangled adaptive fusion transformer using adversarial perturbation for egocentric action anticipation
2025cites this paper
OmniAudio: Generating Spatial Audio from 360-Degree Video
2025influential citation
InfoMAE: Pair-Efficient Cross-Modal Alignment for Multimodal Time-Series Sensing Signals
2025cites this paper
Audio-visual self-supervised representation learning: A survey
2025cites this paper
Learning Probabilistic Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
2025cites this paper
Universal Time-Series Representation Learning: A Survey
2024cites this paper
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
2024cites this paper
Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video Parsing
2024cites this paper
Aligning Audio-Visual Joint Representations with an Agentic Workflow
2024cites this paper
MMAL: Multi-Modal Analytic Learning for Exemplar-Free Audio-Visual Class Incremental Tasks
2024cites this paper
ANAVI: Audio Noise Awareness using Visuals of Indoor environments for NAVIgation
2024cites this paper
Multi-scale Multi-instance Visual Sound Localization and Segmentation
2024cites this paper
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
2024cites this paper
Enhancing Sound Source Localization via False Negative Elimination
2024cites this paper
Computer Audition: From Task-Specific Machine Learning to Foundation Models
2024cites this paper
Enhanced video clustering using multiple riemannian manifold-valued descriptors and audio-visual information
2024cites this paper
Audio-visual Generalized Zero-shot Learning the Easy Way
2024cites this paper
Semantic Grouping Network for Audio Source Separation
2024cites this paper
Deep learning to quantify care manipulation activities in neonatal intensive care units
2024cites this paper
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
2024cites this paper
Unified Video-Language Pre-training with Synchronized Audio
2024cites this paper
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
2024cites this paper
Enhancing Spatial Audio Generation with Source Separation and Channel Panning Loss
2024cites this paper
Visually Guided Audio Source Separation with Meta Consistency Learning
2024cites this paper
Text-to-Audio Generation Synchronized with Videos
2024cites this paper
Positive and Negative Sampling Strategies for Self-Supervised Learning on Audio-Video Data
2024influential citation
BAT: Learning to Reason about Spatial Sounds with Large Language Models
2024cites this paper
Audio-Infused Automatic Image Colorization by Exploiting Audio Scene Semantics
2024cites this paper
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions Through Masked Modeling
2023cites this paper
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
2023cites this paper
Novel-View Acoustic Synthesis
2023cites this paper
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation
2023cites this paper
Transformers in Speech Processing: A Survey
2023cites this paper
Audio-Visual Grouping Network for Sound Localization from Mixtures
2023cites this paper
Self-Supervised Multimodal Learning: A Survey
2023cites this paper
MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer
2023cites this paper
AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation
2023cites this paper
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
2023cites this paper
Off-Screen Sound Separation Based on Audio-visual Pre-training Using Binaural Audio
2023cites this paper
A Comprehensive Survey on Segment Anything Model for Vision and Beyond
2023cites this paper
DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment
2023cites this paper
A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition
2023cites this paper
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
2023cites this paper
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment
2023cites this paper
Modality Influence in Multimodal Machine Learning
2023cites this paper
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
2023influential citation
Multimodal Imbalance-Aware Gradient Modulation for Weakly-Supervised Audio-Visual Video Parsing
2023cites this paper
Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning
2023influential citation
Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions
2023cites this paper
Spherical Vision Transformer for 360-degree Video Saliency Prediction
2023cites this paper
Class-Incremental Grouping Network for Continual Audio-Visual Learning
2023cites this paper
D-SAV360: A Dataset of Gaze Scanpaths on 360° Ambisonic Videos
2023cites this paper
Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering
2023cites this paper
CAD - Contextual Multi-modal Alignment for Dynamic AVQA
2023cites this paper
SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization
2023cites this paper
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
2023cites this paper
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
2023cites this paper
DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
2023influential citation
Weakly-Supervised Audio-Visual Segmentation
2023cites this paper
A Model for the Automatic Mixing of Multiple Audio and Video Clips
2023cites this paper
Bootstrapping Autonomous Driving Radars with Self-Supervised Learning
2023cites this paper
VV360 database: Vídeos omnidirecionais para detecção e rastreamento de elementos no trânsito
2023cites this paper
Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition
2023cites this paper
Visual Acoustic Matching
2022cites this paper
Learning from the Best: Contrastive Representations Learning Across Sensor Locations for Wearable Activity Recognition
2022cites this paper
Edge-Based Cross-Modal Communications for Remote Healthcare
2022cites this paper
Camera Pose Estimation and Localization with Active Audio Sensing
2022cites this paper
Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding
2022cites this paper
Self-Supervised Learning of Audio Representations using Angular Contrastive Loss
2022cites this paper
Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs
2022cites this paper
Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual Imitation Learning
2022cites this paper
An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction
2022cites this paper
Audio-Visual Contrastive Learning for Self-supervised Action Recognition
2022cites this paper
Self-Supervised Contrastive Learning for Audio-Visual Action Recognition
2022cites this paper
Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining
2022cites this paper
Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes
2022cites this paper
Audio-Visual MLP for Scoring Sport
2022cites this paper
Localizing Visual Sounds the Easy Way
2022cites this paper
Skating-Mixer: Multimodal MLP for Scoring Figure Skating
2022cites this paper
Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
2022cites this paper
Audio-visual speech separation based on joint feature representation with cross-modal attention
2022cites this paper
Audio self-supervised learning: A survey
2022cites this paper