Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.
Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
Feiyang Xiao,Qiaoxi Zhu,Jian Guan,Xubo Liu,Haohe Liu,Kejia Zhang,Wenwu Wang
Published 2023 in arXiv.org
ABSTRACT
PUBLICATION RECORD
- Publication year
2023
- Venue
arXiv.org
- Publication date
2023-09-18
- Fields of study
Computer Science, Engineering
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-22 of 22 references · Page 1 of 1
CITED BY
Showing 1-2 of 2 citing papers · Page 1 of 1