Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Feiyang Xiao,Qiaoxi Zhu,Jian Guan,Xubo Liu,Haohe Liu,Kejia Zhang,Wenwu Wang

Published 2023 in arXiv.org

ABSTRACT

Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.

PUBLICATION RECORD

Publication year
2023
Venue
arXiv.org
Publication date
2023-09-18
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2309.09705 arXiv 2309.09705
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
2023cited by this paper
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
2023cited by this paper
Graph Attention for Automated Audio Captioning
2023influential reference
Local Information Assisted Attention-Free Decoder for Audio Captioning
2022cited by this paper
Leveraging Pre-trained BERT for Audio Captioning
2022cited by this paper
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS
2021cited by this paper
An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning
2021influential reference
Can Audio Captions Be Evaluated With Image Caption Metrics?
2021cited by this paper
CL4AC: A Contrastive Loss for Audio Captioning
2021cited by this paper
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval
2020cited by this paper
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
2019cited by this paper
AudioCaps: Generating Captions for Audios in The Wild
2019influential reference
Clotho: an Audio Captioning Dataset
2019cited by this paper
Fixing Weight Decay Regularization in Adam
2017influential reference
Audio Set: An ontology and human-labeled dataset for audio events
2017cited by this paper
Multimodal Machine Learning: A Survey and Taxonomy
2017cited by this paper
Automated audio captioning with recurrent neural networks
2017cited by this paper
Improved Image Captioning via Policy Gradient optimization of SPIDEr
2016cited by this paper
SPICE: Semantic Propositional Image Caption Evaluation
2016cited by this paper
CIDEr: Consensus-based image description evaluation
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
Efficient Estimation of Word Representations in Vector Space
2013cited by this paper

CITED BY

Common Canvas: Open Diffusion Models Trained on Creative-Commons Images
2024cites this paper
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
2023cites this paper