SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Junbo Wang,Haofeng Tan,Bowen Liao,Albert Q. Jiang,Teng Fei,Qixing Huang,Zhengzhong Tu,Shan Ye,Yuhao Kang

Published 2025 in arXiv.org

ABSTRACT

Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception. Project page: https://gisense.github.io/SounDiT-Page/

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-05-19
Fields of study
Geography, Computer Science, Engineering, Environmental Science
Identifiers
DOI 10.48550/arXiv.2505.12734 arXiv 2505.12734
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Soundscape-to-panorama: spatialize auditory perception by linking acoustic environment to panorama
2025cited by this paper
Qwen2.5-VL Technical Report
2025cited by this paper
SoundBrush: Sound as a Brush for Visual Scene Editing
2025cited by this paper
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
2024cited by this paper
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
2024cited by this paper
BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics
2024cited by this paper
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
2024cited by this paper
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
2024cited by this paper
Sound evidence for biodiversity monitoring
2024cited by this paper
From hearing to seeing: Linking auditory and visual place perceptions with soundscape-to-image generative artificial intelligence
2024cited by this paper
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
2024cited by this paper
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
2024cited by this paper
SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models
2024cited by this paper
Latte: Latent Diffusion Transformer for Video Generation
2024cited by this paper
HunyuanVideo: A Systematic Framework For Large Video Generative Models
2024cited by this paper
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
2023cited by this paper
Can CLIP Help Sound Source Localization?
2023cited by this paper
Estimating urban noise along road network from street view imagery
2023cited by this paper
Exploring Noise Pollution, Causes, Effects, and Mitigation Strategies: A Review Paper
2023cited by this paper
Natural sounds can encourage social interactions in urban parks
2023cited by this paper
Soundscape in city and built environment: current developments and design potentials
2023cited by this paper
Urban visual intelligence: Uncovering hidden city profiles with street view images
2023cited by this paper
AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
2023cited by this paper
Any-to-Any Generation via Composable Diffusion
2023cited by this paper
ImageBind One Embedding Space to Bind Them All
2023cited by this paper
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
2023influential reference
The impact of the acoustic environment on human emotion and experience: A case study of worship spaces
2022cited by this paper
Scalable Diffusion Models with Transformers
2022cited by this paper
Sound-Guided Semantic Video Generation
2022influential reference
Learning Visual Styles from Audio-Visual Associations
2022influential reference
Examining the association between the built environment and pedestrian volume using street view images
2022cited by this paper
Wav2CLIP: Learning Robust Audio Representations from Clip
2021cited by this paper
Urban soundscape categorization based on individual recognition, perception, and assessment of sound environments
2021cited by this paper
Self-supervised Audiovisual Representation Learning for Remote Sensing Data
2021cited by this paper
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
2021cited by this paper
Urban green space soundscapes and their perceived restorativeness
2021cited by this paper
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
2021cited by this paper
Soundscape Perceptions and Preferences for Different Groups of Users in Urban Recreational Forest Parks
2021cited by this paper
High-Resolution Image Synthesis with Latent Diffusion Models
2021cited by this paper
Mapping landscape spaces: Methods for understanding spatial-visual characteristics in landscape design
2020cited by this paper
Vggsound: A Large-Scale Audio-Visual Dataset
2020cited by this paper
A review of urban physical environment sensing using street view imagery in public health studies
2020cited by this paper
SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context
2020cited by this paper
SoundSpaces: Audio-Visual Navigation in 3D Environments
2019cited by this paper
Overview of BirdCLEF 2019: Large-Scale Bird Recognition in Soundscapes
2019cited by this paper
SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network
2019cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
Mapping sky, tree, and building view factors of street canyons in a high-density urban environment
2018cited by this paper
GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Places: A 10 Million Image Database for Scene Recognition
2018cited by this paper
Measuring human perceptions of a large-scale urban region using machine learning
2018cited by this paper
Acoustic Sensors for Air and Surface Navigation Applications
2018cited by this paper
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English
2018cited by this paper
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
2017cited by this paper
Emo-soundscapes: A dataset for soundscape emotion recognition
2017cited by this paper
Effects of Soundscape on the Environmental Restoration in Urban Natural Environments
2017cited by this paper
VoxCeleb: A Large-Scale Speaker Identification Dataset
2017cited by this paper
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
2017cited by this paper
Audio Set: An ontology and human-labeled dataset for audio events
2017cited by this paper
Rethinking Atrous Convolution for Semantic Image Segmentation
2017cited by this paper
Visual to Sound: Generating Natural Sound for Videos in the Wild
2017cited by this paper
EigenScape : A Database of Spatial Acoustic Scene Recordings
2017cited by this paper
Semantic Understanding of Scenes Through the ADE20K Dataset
2016cited by this paper
Densely Connected Convolutional Networks
2016cited by this paper
Detection, classification, and mapping of U.S. traffic signs using google street view images for roadway inventory management
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Visually Indicated Sounds
2015cited by this paper
Landscape Ecology in Theory and Practice
2015cited by this paper
GENeration
2015cited by this paper
Human–environment interactions in urban green spaces — A systematic review of contemporary issues and prospects for future research
2015cited by this paper
Connecting soundscape to landscape: Which acoustic index best describes landscape configuration?
2015cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
A Dataset and Taxonomy for Urban Sound Research
2014cited by this paper
Streetscore -- Predicting the Perceived Safety of One Million Streetscapes
2014cited by this paper
Soundscape planning as a complement to environmental noise management
2014cited by this paper
Morphological characteristics of urban water bodies: mechanisms of change and implications for ecosystem function.
2014cited by this paper
Landscape effects on soundscape experience in city parks.
2013cited by this paper
Discovering regions of different functions in a city using human mobility and POIs
2012cited by this paper
Acoustical characteristics of water sounds for soundscape enhancement in urban open spaces.
2012cited by this paper
The Place of Landscape: Concepts, Contexts, Studies
2011cited by this paper
What is soundscape ecology? An introduction and overview of an emerging new science
2011cited by this paper
The soundscape approach for early stage urban planning: a case study
2010cited by this paper
Categorization of environmental sounds
2009cited by this paper
An approach to the acoustic design of outdoor space
2004cited by this paper
Living in the Landscape: Toward an Aesthetics of Environment
1997cited by this paper
Sound and Geographic Visualization
1994cited by this paper
The Experience of Nature: A Psychological Perspective
1989cited by this paper
A Pattern Language: Towns, Buildings, Construction
1981cited by this paper
Space and Place: The Perspective of Experience.
1978cited by this paper
IMAGES AND MENTAL MAPS
1975cited by this paper
An approach to environmental psychology
1974cited by this paper

CITED BY

No citing papers are available for this paper.