Automatic Audio Captioning using Attention weighted Event based Embeddings

Swapnil Bhosale,Rupayan Chakraborty,S. Kopparapu

Published 2022 in arXiv.org

ABSTRACT

Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the audio while generating each token.

PUBLICATION RECORD

Publication year
2022
Venue
arXiv.org
Publication date
2022-01-28
Fields of study
Computer Science, Engineering
Identifiers
arXiv 2201.12352
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Audio Retrieval with Natural Language Queries
2021cited by this paper
An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning
2021cited by this paper
Audio Captioning with Composition of Acoustic and Semantic Information
2021cited by this paper
AST: Audio Spectrogram Transformer
2021cited by this paper
Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning
2021cited by this paper
A Transformer-based Audio Captioning Model with Keyword Estimation
2020cited by this paper
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval
2020cited by this paper
Enhancing Sound Texture in CNN-based Acoustic Scene Classification
2019cited by this paper
Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
2019cited by this paper
AudioCaps: Generating Captions for Audios in The Wild
2019influential reference
Clotho: an Audio Captioning Dataset
2019influential reference
A Comprehensive Survey of Deep Learning for Image Captioning
2018cited by this paper
General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline
2018cited by this paper
An Overview of Audio Event Detection Methods from Feature Extraction to Classification
2017cited by this paper
Audio Set: An ontology and human-labeled dataset for audio events
2017cited by this paper
Freesound Datasets: A Platform for the Creation of Open Audio Datasets
2017cited by this paper
Automated audio captioning with recurrent neural networks
2017cited by this paper
Context-based environmental audio event recognition for scene understanding
2015cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
Acoustic Scene Classification: Classifying environments from the sounds they produce
2014cited by this paper
Deep AutoRegressive Networks
2013cited by this paper
Dynamic captioning: video accessibility enhancement for hearing impairment
2010cited by this paper
Non-speech audio event detection
2009cited by this paper

CITED BY

No citing papers are available for this paper.