Sound-to-Imagination: An Exploratory Study on Cross-Modal Translation Using Diverse Audiovisual Data

Published 2023 in Applied Sciences

ABSTRACT

The motivation of our research is to explore the possibilities of automatic sound-to-image (S2I) translation for enabling a human receiver to visually infer occurrences of sound-related events. We expect the computer to ‘imagine’ scenes from captured sounds, generating original images that depict the sound-emitting sources. Previous studies on similar topics opted for simplified approaches using data with low content diversity and/or supervision/self-supervision for training. In contrast, our approach involves performing S2I translation using thousands of distinct and unknown scenes, using sound class annotations solely for data preparation, just enough to ensure aural–visual semantic coherence. To model the translator, we employ an audio encoder and a conditional generative adversarial network (GAN) with a deep densely connected generator. Furthermore, we present a solution using informativity classifiers for quantitatively evaluating the generated images. This allows us to analyze the influence of network-bottleneck variation on the translation process, highlighting a potential trade-off between informativity and pixel space convergence. Despite the complexity of the specified S2I translation task, we were able to generalize the model enough to obtain more than 14%, on average, of interpretable and semantically coherent images translated from unknown sounds.

PUBLICATION RECORD

Publication year
2023
Venue
Applied Sciences
Publication date
2023-09-29
Fields of study
Not labeled
Identifiers
DOI 10.3390/app131910833
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

An Attention Enhanced Cross-Modal Image-Sound Mutual Generation Model for Birds
2021cited by this paper
The Cambridge Handbook of the Imagination
2020cited by this paper
Deep Audio-visual Learning: A Survey
2020cited by this paper
Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII
2020cited by this paper
Improving Unsupervised Domain Adaptation with Variational Information Bottleneck
2019cited by this paper
Computational Pathology and Ophthalmic Medical Image Analysis
2018cited by this paper
Imagination Machines: A New Challenge for Artificial Intelligence
2018cited by this paper
CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation
2017cited by this paper
Deep Cross-Modal Audio-Visual Generation
2017cited by this paper
Bag-of-Features Methods for Acoustic Event Detection and Classification
2017cited by this paper
Deconvolution and Checkerboard Artifacts
2016cited by this paper
Do Semantic Parts Emerge in Convolutional Neural Networks?
2016cited by this paper
Label-Free Supervision of Neural Networks with Physics and Domain Knowledge
2016cited by this paper
Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications
2016cited by this paper
Deep Learning: Methods and Applications
2014cited by this paper
The Oxford handbook of the development of imagination
2013cited by this paper
Twelve Conceptions of Imagination
2003cited by this paper
Matching Words and Pictures
2003cited by this paper
Interpreting the Language of Environmental Sounds
1987cited by this paper

CITED BY

Speech-to-Image Generation Using Audio Embeddings and Latent Diffusion Models
2025cites this paper