Optimizing CLAP Reward with LLM Feedback for Semantically Aligned and Diverse Automated Audio Captioning

Seyun Ahn,Pil Moo Byun,Won-Gook Choi,Joon-Hyuk Chang

Published 2025 in Interspeech

ABSTRACT

Deep learning-based automated audio captioning (AAC) systems describe audio well, yet they often overfit to reference styles. To address this, reinforcement learning (RL) techniques have been adopted to directly optimize evaluation metrics, but these methods often suffer from word repetition and contextual distortion. Embedding-based rewards, such as those derived from contrastive language-audio pretraining (CLAP), may bias the model toward specific words or phrases that human evaluators find unnatural. In this paper, we propose a novel reward system that combines a CLAP-based reward with a repetition penalty (CRRP) and a large language model (LLM) evaluator. CRRP computes rewards using CLAP similarity, applies a repetition penalty and reward clipping to stabilize training, and uses LLM feedback to enhance naturalness. Our method shows outstanding performance in semantic evaluations and both human and AI-based assessments, with results available at https://yunniya097.github.io/CRRP/.

PUBLICATION RECORD

Publication year
2025
Venue
Interspeech
Publication date
2025-08-17
Fields of study
Computer Science
Identifiers
DOI 10.21437/interspeech.2025-1313
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
2024cited by this paper
VLRM: Vision-Language Models act as Reward Models for Image Captioning
2024cited by this paper
Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback
2024cited by this paper
Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning
2024cited by this paper
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
2023cited by this paper
CLAP Learning Audio Concepts from Natural Language Supervision
2023cited by this paper
Fine-grained Image Captioning with CLIP Reward
2022cited by this paper
Towards Generating Diverse Audio Captions via Adversarial Training
2022cited by this paper
Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
2022cited by this paper
Automated audio captioning: an overview of recent progress and new challenges
2022cited by this paper
An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Can Audio Captions Be Evaluated With Image Caption Metrics?
2021cited by this paper
Diverse Audio Captioning Via Adversarial Training
2021cited by this paper
AudioCaps: Generating Captions for Audios in The Wild
2019cited by this paper
Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning
2019cited by this paper
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
2019cited by this paper
Clotho: an Audio Captioning Dataset
2019cited by this paper
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
2019cited by this paper
Deep Reinforcement Learning-Based Image Captioning with Embedding Reward
2017cited by this paper
Video Captioning via Hierarchical Reinforcement Learning
2017cited by this paper
Attention is All you Need
2017cited by this paper
Self-Critical Sequence Training for Image Captioning
2016cited by this paper
RECURRENT NEURAL NETWORKS
2015cited by this paper
CIDEr: Consensus-based image description evaluation
2014cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper

CITED BY

Aligning Audio Captions with Human Preferences
2025cites this paper