Visual to Sound: Generating Natural Sound for Videos in the Wild

Yipin Zhou,Zhaowen Wang,Chen Fang,Trung Bui,Tamara L. Berg

Published 2017 in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

ABSTRACT

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.

PUBLICATION RECORD

Publication year
2017
Venue
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Publication date
2017-12-04
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1109/CVPR.2018.00374 arXiv 1712.01393
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Learning to Localize Sound Source in Visual Scenes
2018cited by this paper
GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
2018cited by this paper
Learning to Separate Object Sounds by Watching Unlabeled Video
2018cited by this paper
Deep Cross-Modal Audio-Visual Generation
2017cited by this paper
Generative Modeling of Audible Shapes for Object Perception
2017cited by this paper
Look, Listen and Learn
2017cited by this paper
Audio Set: An ontology and human-labeled dataset for audio events
2017cited by this paper
Char2Wav: End-to-End Speech Synthesis
2017cited by this paper
Objects that Sound
2017cited by this paper
Unsupervised Learning of Spoken Language with Visual Context
2016cited by this paper
YouTube-8M: A Large-Scale Video Classification Benchmark
2016cited by this paper
Ambient Sound Provides Supervision for Visual Learning
2016cited by this paper
SoundNet: Learning Sound Representations from Unlabeled Video
2016cited by this paper
WaveNet: A Generative Model for Raw Audio
2016cited by this paper
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
2016influential reference
Visually Indicated Sounds
2015influential reference
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014influential reference
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Large-Scale Video Classification with Convolutional Neural Networks
2014cited by this paper
Two-Stream Convolutional Networks for Action Recognition in Videos
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014influential reference
Statistical parametric speech synthesis using deep neural networks
2013cited by this paper
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2012cited by this paper
Secrets of optical flow estimation and their principles
2010cited by this paper
Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
2009cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Statistical Parametric Speech Synthesis
2007cited by this paper
Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis
1999cited by this paper
Long Short-Term Memory
1997cited by this paper
Unit selection in a concatenative speech synthesis system using a large speech database
1996cited by this paper

CITED BY

Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
2026cites this paper
Acoustic Field Video for Multimodal Scene Understanding
2026cites this paper
Neural Audio Synthesis for Sound Effects: A Scope Review
2026cites this paper
LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
2026cites this paper
FoleySpace: Vision-Aligned Binaural Spatial Audio Generation
2025cites this paper
StreetViewAI: Making Street View Accessible Using Context-Aware Multimodal AI
2025cites this paper
Enhancing semantic audio-visual representation learning with supervised multi-scale attention
2025cites this paper
Visual Acoustic Fields
2025cites this paper
A Survey of Multimodal Learning: Methods, Applications, and Future
2025cites this paper
Enhancing English Speaking Proficiency through Video Assignments: Addressing Linguistic and Psychological Challenges among Indonesian EFL Students
2025cites this paper
Deep Learning for Personalized Binaural Audio Reproduction
2025cites this paper
SonicGauss: Position-Aware Physical Sound Synthesis for 3D Gaussian Representations
2025cites this paper
A Comprehensive Survey on Generative AI for Video-to-Music Generation
2025cites this paper
Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes
2025cites this paper
Face and voice cross-modal association with learning convex feature embedding
2025cites this paper
Understanding Research Themes and Interactions at Scale within Blind and Low-vision Research in ACM and IEEE
2025cites this paper
Orchestrating Audio: Multi-Agent Framework for Long-Video Audio Synthesis
2025cites this paper
Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning
2025influential citation
Sounding that Object: Interactive Object-Aware Image to Audio Generation
2025cites this paper
Long-Video Audio Synthesis with Multi-Agent Collaboration
2025cites this paper
ViSAGe: Video-to-Spatial Audio Generation
2025cites this paper
Development and Evaluation of an Auditory VR Generative System via Natural Language Interaction to Aid Exposure Therapy for PTSD Patients
2025cites this paper
The Impact of AI in the Field of Sound for Picture. A Historical, Practical, and Ethical Consideration
2025cites this paper
VGGSounder: Audio-Visual Evaluations for Foundation Models
2025cites this paper
SeeingSounds: Learning Audio-to-Visual Alignment via Text
2025cites this paper
ASAudio: A Survey of Advanced Spatial Audio Research
2025cites this paper
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
2025cites this paper
SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
2025cites this paper
FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
2025cites this paper
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
2024cites this paper
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
2024cites this paper
Enhancing Video Music Recommendation with Transformer-Driven Audio-Visual Embeddings
2024cites this paper
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
2024cites this paper
AutoSFX: Automatic Sound Effect Generation for Videos
2024cites this paper
VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos
2024cites this paper
Gotta Hear Them All: Towards Sound Source Aware Audio Generation
2024cites this paper
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
2024influential citation
Tell What You Hear From What You See - Video to Audio Generation Through Text
2024cites this paper
Self-Supervised Audio-Visual Soundscape Stylization
2024cites this paper
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
2024cites this paper
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
2024cites this paper
Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach
2024cites this paper
A Music-Driven Dance Generation Method Based on a Spatial-Temporal Refinement Model to Optimize Abnormal Frames
2024cites this paper
Tarsier: Recipes for Training and Evaluating Large Video Description Models
2024cites this paper
Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey
2024influential citation
SonifyAR: Context-Aware Sound Generation in Augmented Reality
2024cites this paper
Let the Beat Follow You - Creating Interactive Drum Sounds From Body Rhythm
2024cites this paper
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
2024cites this paper
Text-to-Audio Generation Synchronized with Videos
2024cites this paper
Deep Cross-Modal Retrieval Between Spatial Image and Acoustic Speech
2024cites this paper
Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video Generation
2024cites this paper
Anchor-aware Deep Metric Learning for Audio-visual Retrieval
2024cites this paper
VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
2024cites this paper
From Seeing to Hearing: A Feasibility Study on Utilizing Regenerated Sounds from Street View Images to Assess Place Perceptions
2024cites this paper
Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
2024cites this paper
FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment
2024cites this paper
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
2024cites this paper
A Dual-Path Cross-Modal Network for Video-Music Retrieval
2023cites this paper
Audiovisual Moments in Time: A large-scale annotated dataset of audiovisual actions
2023cites this paper
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
2023cites this paper
An Initial Exploration: Learning to Generate Realistic Audio for Silent Video
2023cites this paper
Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding
2023cites this paper
Multi-scale network with shared cross-attention for audio–visual correlation learning
2023influential citation
UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
2023influential citation
EMID: An Emotional Aligned Dataset in Audio-Visual Modality
2023cites this paper
Context-Aware Proposal–Boundary Network With Structural Consistency for Audiovisual Event Localization
2023cites this paper
A Demand-Driven Perspective on Generative Audio AI
2023cites this paper
Exploiting Visual Context Semantics for Sound Source Localization
2023cites this paper
GRAVO: Learning to Generate Relevant Audio from Visual Features with Noisy Online Videos
2023cites this paper
Dense Modality Interaction Network for Audio-Visual Event Localization
2023cites this paper
Preserving Modality Structure Improves Multi-Modal Learning
2023cites this paper
LPGL: A Model Applicable to Data Augmentation Algorithms for Depression Detection
2023cites this paper
Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
2023cites this paper
The Object Folder Benchmark : Multisensory Learning with Neural and Real Objects
2023cites this paper
A Perceptually Evaluated Signal Model: Collisions Between a Vibrating Object and an Obstacle
2023cites this paper
Connecting Multi-modal Contrastive Representations
2023cites this paper
VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning
2023cites this paper
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
2023cites this paper
UniBriVL: Robust Audio Representation and Generation of Audio Driven Diffusion Models
2023influential citation
An Overview of Visual Sound Synthesis Generation Tasks Based on Deep Learning Networks
2023influential citation
CAPTDURE: Captioned Sound Dataset of Single Sources
2023cites this paper
LAVSS: Location-Guided Audio-Visual Spatial Audio Separation
2023cites this paper
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
2023cites this paper
Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation
2023cites this paper
Conditional Generation of Audio from Video via Foley Analogies
2023cites this paper
AI (r)evolution - where are we heading? Thoughts about the future of music and sound technologies in the era of deep learning
2023cites this paper
Sound Image Generation in 3D Virtual Space Considering Relationship Between Bounding Box by Object Detection in 2D Image and Sound Pressure Level
2023cites this paper
Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval
2023cites this paper
Dynamic interactive learning network for audio-visual event localization
2023cites this paper
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
2023cites this paper
Environmental Sound Synthesis from Vocal Imitations and Sound Event Labels
2023cites this paper
RD-FGFS: A Rule-Data Hybrid Framework for Fine-Grained Footstep Sound Synthesis from Visual Guidance
2023cites this paper
MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field
2023cites this paper
XVO: Generalized Visual Odometry via Cross-Modal Self-Training
2023cites this paper
Foleygen: Visually-Guided Audio Generation
2023cites this paper
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
2023cites this paper
Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning
2023cites this paper
Exploring the Power of Deep Learning for Seamless Background Audio Generation in Videos
2023cites this paper
Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions
2023cites this paper
Exploring Efficient-Tuned Learning Audio Representation Method from BriVL
2023cites this paper