Deep learning-based automated audio captioning (AAC) systems describe audio well, yet they often overfit to reference styles. To address this, reinforcement learning (RL) techniques have been adopted to directly optimize evaluation metrics, but these methods often suffer from word repetition and contextual distortion. Embedding-based rewards, such as those derived from contrastive language-audio pretraining (CLAP), may bias the model toward specific words or phrases that human evaluators find unnatural. In this paper, we propose a novel reward system that combines a CLAP-based reward with a repetition penalty (CRRP) and a large language model (LLM) evaluator. CRRP computes rewards using CLAP similarity, applies a repetition penalty and reward clipping to stabilize training, and uses LLM feedback to enhance naturalness. Our method shows outstanding performance in semantic evaluations and both human and AI-based assessments, with results available at https://yunniya097.github.io/CRRP/.
Optimizing CLAP Reward with LLM Feedback for Semantically Aligned and Diverse Automated Audio Captioning
Seyun Ahn,Pil Moo Byun,Won-Gook Choi,Joon-Hyuk Chang
Published 2025 in Interspeech
ABSTRACT
PUBLICATION RECORD
- Publication year
2025
- Venue
Interspeech
- Publication date
2025-08-17
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-26 of 26 references · Page 1 of 1
CITED BY
Showing 1-1 of 1 citing papers · Page 1 of 1