Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

Han Yin,Yang Xiao,Younghoo Kwon,Ting Dang,Jung-Woo Choi

Published 2026 in Unknown venue

ABSTRACT

Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs'noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-03-05
Fields of study
Computer Science, Engineering
Identifiers
arXiv 2603.04862
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models
2026cited by this paper
SAM Audio: Segment Anything in Audio
2025cited by this paper
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
2025cited by this paper
Qwen3-Omni Technical Report
2025cited by this paper
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
2025cited by this paper
MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
2025cited by this paper
Fun-Audio-Chat Technical Report
2025cited by this paper
Qwen3 Technical Report
2025cited by this paper
Kimi-Audio Technical Report
2025cited by this paper
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
2025cited by this paper
Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
2025cited by this paper
Sub-Band and Full-Band Interactive U-Net with Dprnn for Demixing Cross-Talk Stereo Music
2024cited by this paper
An Investigation of Incorporating Mamba For Speech Enhancement
2024cited by this paper
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
2024cited by this paper
AudioBench: A Universal Benchmark for Audio Large Language Models
2024cited by this paper
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
2024cited by this paper
Connecting Speech Encoder and Large Language Model for ASR
2023cited by this paper
Pengi: An Audio Language Model for Audio Tasks
2023cited by this paper
Separate Anything You Describe
2023cited by this paper
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning
2023cited by this paper
Investigating the Catastrophic Forgetting in Multimodal Large Language Models
2023cited by this paper
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
2023cited by this paper
CochlScene: Acquisition of acoustic scene data using crowdsourcing
2022cited by this paper
A voice controlled smart home automation system using artificial intelligent and internet of things
2022cited by this paper
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition
2022cited by this paper
How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR
2022cited by this paper
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
2021cited by this paper
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio
2021cited by this paper
Vggsound: A Large-Scale Audio-Visual Dataset
2020cited by this paper
FSD50K: An Open Dataset of Human-Labeled Sound Events
2020cited by this paper
Exploration and assessment of proactive use cases for an in-car voice assistant
2019cited by this paper
GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Audio Set: An ontology and human-labeled dataset for audio events
2017cited by this paper
An investigation of the use of robots in public spaces
2015cited by this paper
Noisy training for deep neural networks in speech recognition
2015cited by this paper
Librispeech: An ASR corpus based on public domain audio books
2015cited by this paper
A Dataset and Taxonomy for Urban Sound Research
2014cited by this paper
An end-to-end integration of speech separation and recognition with self-supervised learning representation
year unknowncited by this paper

CITED BY

No citing papers are available for this paper.