CL4AC: A Contrastive Loss for Audio Captioning

Xubo Liu,Qiushi Huang,Xinhao Mei,Tom Ko,H. L. Tang,Mark D. Plumbley,Wenwu Wang

Published 2021 in Workshop on Detection and Classification of Acoustic Scenes and Events

ABSTRACT

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

PUBLICATION RECORD

Publication year
2021
Venue
Workshop on Detection and Classification of Acoustic Scenes and Events
Publication date
2021-07-21
Fields of study
Computer Science, Engineering
Identifiers
arXiv 2107.09990
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Token-Level Supervised Contrastive Learning for Punctuation Restoration
2021cited by this paper
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS
2021cited by this paper
Audio Captioning Transformer
2021cited by this paper
An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning
2021cited by this paper
Supervised Contrastive Learning
2020cited by this paper
Audio Captioning Based on Combined Audio and Semantic Embeddings
2020cited by this paper
Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning
2020cited by this paper
WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information
2020cited by this paper
Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning
2020cited by this paper
Audio Captioning Based on Transformer and Pre-Trained CNN
2020cited by this paper
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
2020cited by this paper
A Transformer-based Audio Captioning Model with Keyword Estimation
2020cited by this paper
A Simple Framework for Contrastive Learning of Visual Representations
2020cited by this paper
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
2020cited by this paper
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
2019cited by this paper
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
2019cited by this paper
AudioCaps: Generating Captions for Audios in The Wild
2019cited by this paper
Clotho: an Audio Captioning Dataset
2019influential reference
DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation
2019cited by this paper
Momentum Contrast for Unsupervised Visual Representation Learning
2019cited by this paper
Meshed-Memory Transformer for Image Captioning
2019cited by this paper
Representation Learning with Contrastive Predictive Coding
2018cited by this paper
Automated audio captioning with recurrent neural networks
2017cited by this paper
SPICE: Semantic Propositional Image Caption Evaluation
2016cited by this paper
Improved Image Captioning via Policy Gradient optimization of SPIDEr
2016cited by this paper
CIDEr: Consensus-based image description evaluation
2014influential reference
Adam: A Method for Stochastic Optimization
2014cited by this paper
Efficient Estimation of Word Representations in Vector Space
2013cited by this paper
Distributed Representations of Words and Phrases and their Compositionality
2013cited by this paper
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments
2007cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference

CITED BY

FD-DeCap: A Front-Door Causal Inference-Based Framework for Debiasing Automatic Audio Captioning
2026cites this paper
Temp4Cap: Temporally-aligned Automated Audio Captioning
2025cites this paper
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
2025influential citation
Generating Accurate and Diverse Audio Captions Through Variational Autoencoder Framework
2024cites this paper
PFCA-Net: Pyramid Feature Fusion and Cross Content Attention Network for Automated Audio Captioning
2024cites this paper
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
2024cites this paper
Audio-Language Datasets of Scenes and Events: A Survey
2024cites this paper
Fine-grained Audible Video Description
2023cites this paper
Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning
2023cites this paper
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
2023cites this paper
ACTUAL: Audio Captioning With Caption Feature Space Regularization
2023cites this paper
Separate Anything You Describe
2023cites this paper
Training Audio Captioning Models without Audio
2023cites this paper
Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
2023cites this paper
Zero-shot audio captioning with audio-language model guidance and audio context keywords
2023cites this paper
Towards Generating Diverse Audio Captions via Adversarial Training
2022cites this paper
Automated audio captioning: an overview of recent progress and new challenges
2022cites this paper
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features
2022influential citation
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
2022cites this paper
ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification
2022cites this paper
Separate What You Describe: Language-Queried Audio Source Separation
2022influential citation
Leveraging Pre-trained BERT for Audio Captioning
2022cites this paper
ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation
2022cites this paper
Local Information Assisted Attention-Free Decoder for Audio Captioning
2022cites this paper
On Metric Learning for Audio-Text Cross-Modal Retrieval
2022cites this paper
Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning
2022cites this paper
Caption Feature Space Regularization for Audio Captioning
2022cites this paper
Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning
2022cites this paper
Audio Captioning Transformer
2021cites this paper
Diverse Audio Captioning Via Adversarial Training
2021cites this paper
Audio Retrieval With Natural Language Queries: A Benchmark Study
2021cites this paper
Conformer-Based Self-Supervised Learning For Non-Speech Audio Tasks
2021cites this paper