Automated audio captioning with recurrent neural networks

K. Drossos,Sharath Adavanne,Tuomas Virtanen

Published 2017 in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

ABSTRACT

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

PUBLICATION RECORD

Publication year
2017
Venue
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
Publication date
2017-06-30
Fields of study
Computer Science
Identifiers
DOI 10.1109/WASPAA.2017.8170058 arXiv 1706.10006
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Sound event detection using spatial features and convolutional recurrent neural network
2017cited by this paper
Theano: A Python framework for fast computation of mathematical expressions
2016cited by this paper
Image Captioning with Semantic Attention
2016cited by this paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016influential reference
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
2016influential reference
Beyond caption to narrative: Video captioning with multiple sentences
2016cited by this paper
Microsoft COCO Captions: Data Collection and Evaluation Server
2015influential reference
CIDEr: Consensus-based image description evaluation
2014influential reference
Show and tell: A neural image caption generator
2014influential reference
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014influential reference
Adam: A Method for Stochastic Optimization
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014influential reference
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
2014cited by this paper
Maxout Networks
2013cited by this paper
Automatic audio tagging using covariate shift adaptation
2010cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010cited by this paper
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments
2007influential reference
ROUGE: A Package for Automatic Evaluation of Summaries
2004influential reference
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference

CITED BY

The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels
2026cites this paper
Decoding content taxonomy in video for industry-specific contextual advertising
2026cites this paper
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
2025cites this paper
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
2025cites this paper
OnomaCap: Making Non-speech Sound Captions Accessible and Enjoyable through Onomatopoeic Sound Representation
2025cites this paper
Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection
2025cites this paper
JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection
2025cites this paper
Human-CLAP: Human-perception-based Contrastive Language-audio Pretraining
2025cites this paper
CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer
2025cites this paper
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
2025cites this paper
Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks
2025cites this paper
Transfer Learning for Audio Captioning
2025cites this paper
Evaluating Transformer-Based Architectures for Simultaneous Audio Speech Transcription and Background Audio Captioning
2025cites this paper
MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
2025cites this paper
DistillCaps: Enhancing Audio-Language Alignment in Captioning via Retrieval-Augmented Knowledge Distillation
2025cites this paper
Auditory Intelligence: Understanding the World Through Sound
2025cites this paper
MiDashengLM: Efficient Audio Understanding with General Audio Captions
2025cites this paper
Multiple Choice Learning of Low Rank Adapters for Language Modeling
2025cites this paper
A Detailed Audio-Text Data Simulation Pipeline Using Single-Event Sounds
2024cites this paper
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
2024cites this paper
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
2024cites this paper
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
2024cites this paper
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
2024cites this paper
Diffusion-based diverse audio captioning with retrieval-guided Langevin dynamics
2024cites this paper
Audio-Language Datasets of Scenes and Events: A Survey
2024cites this paper
Analysis and interpretation of joint source separation and sound event detection in domestic environments
2024cites this paper
Zero-Shot Audio Captioning Using Soft and Hard Prompts
2024cites this paper
AudioSetCaps: An Enriched Audio-Caption Dataset Using Automated Generation Pipeline With Large Audio and Language Models
2024cites this paper
Construction and Analysis of Impression Caption Dataset for Environmental Sounds
2024cites this paper
A decade of DCASE: Achievements, practices, evaluations and future challenges
2024cites this paper
Soundscape Captioning using Sound Affective Quality Network and Large Language Model
2024cites this paper
Generating Accurate and Diverse Audio Captions Through Variational Autoencoder Framework
2024cites this paper
MMAD:Multi-modal Movie Audio Description
2024cites this paper
Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTube
2024cites this paper
ACTUAL: Audio Captioning With Caption Feature Space Regularization
2023cites this paper
A review of deep learning techniques in audio event recognition (AER) applications
2023cites this paper
Improving Audio Caption Fluency with Automatic Error Correction
2023cites this paper
A Novel Metric For Evaluating Audio Caption Similarity
2023cites this paper
Semi-supervsied Learning-based Sound Event Detection using Freuqency Dynamic Convolution with Large Kernel Attention for DCASE Challenge 2023 Task 4
2023cites this paper
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
2023cites this paper
Spice+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models
2023cites this paper
Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning
2023cites this paper
Automated Audio Captioning With Topic Modeling
2023influential citation
HEAR4Health: a blueprint for making computer audition a staple of modern healthcare
2023cites this paper
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
2023cites this paper
Prefix Tuning for Automated Audio Captioning
2023cites this paper
Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator
2023cites this paper
Graph Attention for Automated Audio Captioning
2023cites this paper
Machine Generation of Audio Description for Blind and Visually Impaired People
2023cites this paper
Multitask Learning in Audio Captioning: A Sentence Embedding Regression Loss Acts as a Regularizer
2023cites this paper
Audio Captioning Using Semantic Alignment Enhancer
2023cites this paper
Evaluating the Potential and Realized Impact of Data Augmentations
2023cites this paper
Zero-shot audio captioning with audio-language model guidance and audio context keywords
2023cites this paper
Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction
2023cites this paper
Improving Audio Captioning Models with Fine-Grained Audio Features, Text Embedding Supervision, and LLM Mix-Up Augmentation
2023cites this paper
Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions
2023cites this paper
Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
2023cites this paper
Audio Difference Learning for Audio Captioning
2023cites this paper
CoNeTTE: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding
2023cites this paper
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
2023cites this paper
Separate Anything You Describe
2023cites this paper
Using various pre-trained models for audio feature extraction in automated audio captioning
2023cites this paper
Retrieve, Reason, and Refine: Generating Accurate and Faithful Patient Instructions
2022cites this paper
Local Information Assisted Attention-Free Decoder for Audio Captioning
2022cites this paper
Automatic Audio Captioning using Attention weighted Event based Embeddings
2022cites this paper
Joint Speech Recognition and Audio Captioning
2022cites this paper
Leveraging Pre-trained BERT for Audio Captioning
2022cites this paper
Separate What You Describe: Language-Queried Audio Source Separation
2022influential citation
Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning
2022cites this paper
Caption Feature Space Regularization for Audio Captioning
2022cites this paper
Automated Audio Captioning using Audio Event Clues
2022influential citation
Augmented/Mixed Reality Audio for Hearables: Sensing, control, and rendering
2022cites this paper
Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning
2022cites this paper
Automated audio captioning: an overview of recent progress and new challenges
2022influential citation
Automated Audio Captioning and Language-Based Audio Retrieval
2022influential citation
An investigation on selecting audio pre-trained models for audio captioning
2022cites this paper
Many but not all deep neural network audio models capture brain responses and exhibit hierarchical region correspondence
2022cites this paper
iCNN-Transformer: An improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning
2022cites this paper
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features
2022cites this paper
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity
2022cites this paper
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
2022cites this paper
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics
2022cites this paper
Towards Generating Diverse Audio Captions via Adversarial Training
2022influential citation
Web Framework for Enhancing Automated Audio Captioning Performance for Domestic Environment
2022cites this paper
Can Audio Captions Be Evaluated With Image Caption Metrics?
2021cites this paper
Transfer Learning followed by Transformer for Automated Audio Captioning
2021cites this paper
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS
2021cites this paper
Automated Audio Captioning with Weakly Supervised Pre-Training and Word Selection Methods
2021cites this paper
Leveraging State-of-the-art ASR Techniques to Audio Captioning
2021cites this paper
Automated Audio Captioning by Fine-Tuning BART with AudioSet Tags
2021cites this paper
Audio Retrieval With Natural Language Queries: A Benchmark Study
2021cites this paper
New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding
2021cites this paper
MARVEL: Multimodal Extreme Scale Data Analytics for Smart Cities Environments
2021cites this paper
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning
2021cites this paper
Diverse Audio Captioning Via Adversarial Training
2021cites this paper
Unsupervised Audio-Caption Aligning Learns Correspondences Between Individual Sound Events and Textual Phrases
2021cites this paper
Audio Captioning Using Sound Event Detection
2021cites this paper
Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding
2021cites this paper
AUDIO CAPTION GENERATION FROM IMAGES USING DEEP LEARNING
2021cites this paper
A MULTIMODAL WAVETRANSFORMER ARCHITECTURE CONDITIONED ON OPENL3 EMBEDDINGS FOR AUDIO-VISUAL SCENE CLASSIFICATION Technical Report
2021cites this paper