Attention-Based Models for Speech Recognition

J. Chorowski,Dzmitry Bahdanau,Dmitriy Serdyuk,Kyunghyun Cho,Yoshua Bengio

Published 2015 in Neural Information Processing Systems

ABSTRACT

Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks including machine translation, handwriting synthesis [1,2] and image caption generation [3]. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in [2] reaches a competitive 18.7% phoneme error rate (PER) on the TIMET phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the attention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.

PUBLICATION RECORD

Publication year
2015
Venue
Neural Information Processing Systems
Publication date
2015-06-24
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1506.07503
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequences with Recurrent Neural Networks
2016cited by this paper
On Using Monolingual Corpora in Neural Machine Translation
2015cited by this paper
Blocks and Fuel: Frameworks for deep learning
2015cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
End-To-End Memory Networks
2015cited by this paper
Weakly Supervised Memory Networks
2015cited by this paper
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results
2014cited by this paper
Towards End-To-End Speech Recognition with Recurrent Neural Networks
2014cited by this paper
Memory Networks
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014influential reference
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition
2014cited by this paper
Neural Turing Machines
2014influential reference
Deep Speech: Scaling up end-to-end speech recognition
2014cited by this paper
Recurrent Models of Visual Attention
2014cited by this paper
Generating Sequences With Recurrent Neural Networks
2013cited by this paper
Pylearn2: a machine learning research library
2013cited by this paper
Speech recognition with deep recurrent neural networks
2013cited by this paper
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
2012cited by this paper
Improving neural networks by preventing co-adaptation of feature detectors
2012cited by this paper
ADADELTA: An Adaptive Learning Rate Method
2012cited by this paper
Sequence Transduction with Recurrent Neural Networks
2012influential reference
Theano: new features and speed improvements
2012cited by this paper
The Kaldi Speech Recognition Toolkit
2011cited by this paper
Practical Variational Inference for Neural Networks
2011cited by this paper
Theano: A CPU and GPU Math Compiler in Python
2010cited by this paper
Et al
2008cited by this paper
The Application of Hidden Markov Models in Speech Recognition
2007cited by this paper
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
2006cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper
Long Short-Term Memory
1997cited by this paper
Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST
1993cited by this paper
DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1
1993cited by this paper

CITED BY

A novel dendritic neuron model enhanced by the synaptic-attention mechanism and fusion-dendritic layer
2026cites this paper
Explainable attention based few shot LSTM for intrusion detection in imbalanced cyber physical system networks
2026cites this paper
Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances
2026cites this paper
823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio
2026cites this paper
FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition
2026cites this paper
Machine learning-based automatic detection and prediction of cracks and corrosion using spatiotemporal measurements from distributed fiber optic sensors
2026cites this paper
Align-Consistency: Improving Non-autoregressive and Semi-supervised ASR with Consistency Regularization
2026cites this paper
An Autonomous Driving Road Surface Target Detection Method Based on Adaptive Fusion Features
2026cites this paper
Bridging Languages in Healthcare: A Comprehensive Review of Multilingual and Code-Switched Chatbot Interactions
2026cites this paper
Incremental Expansion Analysis and State-of-Health Estimation for Lithium-Ion Batteries
2026cites this paper
Arabic speech command recognition using an enhanced CNN-LSTM model with attention and data augmentation
2026cites this paper
Noise-augmented transformer-based automatic speech recognizer using a novel noise distillation system
2026cites this paper
MDM-ASR: Bridging Accuracy and Efficiency in ASR with Diffusion-Based Non-Autoregressive Decoding
2026cites this paper
A Dual-Branch Deep Learning Approach for Passive Sonar Underwater Target Classification
2026cites this paper
Sublinear Time Quantum Algorithm for Attention Approximation
2026cites this paper
EyeLayer: Integrating Human Attention Patterns into LLM-Based Code Summarization
2026cites this paper
Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition
2026cites this paper
A multimodal runoff forecasting model based on clustering enhancement and attention mechanism
2025cites this paper
Multimodal real-time English translation based on the transformer architecture
2025cites this paper
Human-centric urban thermal comfort prediction using a BiLSTM-GRU-attention hybrid deep learning framework
2025cites this paper
Detection method for improving shape perception of small object defects on metal surfaces
2025cites this paper
Traffic and Weather Data Fusion for Traffic Prediction in Sustainable Cities
2025cites this paper
Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling
2025cites this paper
Text-Enhanced Audio Encoder for Large Language Model based Speech Recognition via Cross-Modality Pre-training with Unpaired Audio-Text Data
2025cites this paper
Towards Efficiently Whisper Fine-tuning with Monotonic Alignments
2025cites this paper
RheOFormer: A generative transformer model for simulation of complex fluids and flows
2025cites this paper
DRLCCI: A hybrid fusion network leveraging disentangled representation learning and cross-modal collaborative interaction for multi-modal sentiment analysis
2025cites this paper
NLP Based Automated Language Translation Leveraging
2025cites this paper
Bridging semantics across modalities: Decoupled representation learning for audio-visual speech recognition
2025cites this paper
A Hybrid Deep Learning Approach for Stock Market Prediction: Integrating EEMD, CNN-LSTM, and Attention Mechanism
2025cites this paper
Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition
2025cites this paper
Enhancing Speech Recognition Performance via Robust Feature Extraction
2025cites this paper
Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches
2025cites this paper
Micromobility Flow Prediction: A Bike Sharing Station-level Study via Multi-level Spatial-Temporal Attention Neural Network
2025cites this paper
Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions
2025cites this paper
Toward Enhancing Cross-Lingual Domain Knowledge Sharing and Transferring for Multilingual Domain Adaptation in NMT
2025cites this paper
Stochastic characteristics of vehicle-bridge vibration under earthquakes with parameter uncertainty: A deep learning-based model
2025cites this paper
Enhancing short-term traffic prediction by integrating trends and fluctuations with attention mechanism
2025cites this paper
MyanSpeech: Joint CTC-Attention and RNN Language Model for End-To-End Read Speech Recognition
2025cites this paper
Automated mitosis detection in stained histopathological images using Faster R-CNN and stain techniques
2025cites this paper
End-to-End Tibetan Lhasa Dialect Speech Recognition Research Based on Conformer
2025cites this paper
Exploring Contextual Knowledge-Enhanced Speech Recognition in Air Traffic Control Communication: A Comparative Study
2025cites this paper
Graph Convolutional Neural Network Algorithms for Bearing Remaining Useful Life Prediction: A Review
2025cites this paper
RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinson's Disease Classification
2025cites this paper
Pattern matters: A deep learning approach with attention mechanism for text abstraction in low-ranked languages
2025cites this paper
Inferring Speaking Styles for Conversational Speech Synthesis by Learning Contextual Dependencies
2025cites this paper
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$A^{2} H^{2}$$\end{document} for multimodal emotional data analysis
2025cites this paper
Mitigating Data Imbalance in Automated Speaking Assessment
2025cites this paper
End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model
2025cites this paper
Spot and Merge: A Hybrid Context Biasing Approach for Rare Word and Out of Vocabulary Recognition
2025cites this paper
Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition
2025cites this paper
Integrating Attention-Enhanced LSTM and Particle Swarm Optimization for Dynamic Pricing and Replenishment Strategies in Fresh Food Supermarkets
2025cites this paper
Psychological Mechanism of Automatic Speech Recognition with Deep Multipath Convolutional Neural Networks
2025cites this paper
The NTNU System at the S&I Challenge 2025 ASR Open Track
2025cites this paper
Rethinking Nonlinearity: Trainable Gaussian Mixture Modules for Modern Neural Architectures
2025cites this paper
How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu
2025cites this paper
Protecting the Mixed-Signal Domain: Secure ADCs for Internet of Things Devices
2025cites this paper
StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction
2025cites this paper
A Neural Model for Contextual Biasing Score Learning and Filtering
2025cites this paper
A Data-Driven Multi-Granularity Attention Framework for Sentiment Recognition in News and User Reviews
2025cites this paper
Drilling Conditions Classification Model Based on an Expert-Knowledge-Fused Convolutional Neural Network with a Multihead Attention Mechanism
2025cites this paper
Epidemiologically-Inspired Dynamic Attention: Leveraging SIR Compartmental Models for Enhanced Neural Network Focus Mechanisms
2025cites this paper
A Hybrid Architecture Combining CNN, LSTM, and Attention Mechanisms for Automatic Speech Recognition
2025influential citation
Multimodal GRU with directed pairwise cross-modal attention for sentiment analysis
2025cites this paper
Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces
2025cites this paper
Attention-driven multi-model architecture for unbalanced network traffic intrusion detection via extreme gradient boosting
2025cites this paper
Cutting‐Edge Deep Learning Methods for Image‐Based Object Detection in Autonomous Driving: In‐Depth Survey
2025cites this paper
Combining multilingual resources to enhance end-to-end speech recognition systems for Scandinavian languages
2025cites this paper
Learning Conjecturing from Scratch
2025cites this paper
SignEdgeLVM transformer model for enhanced sign language translation on edge devices
2025cites this paper
Token-Level Contextual Network with Ladder-Shaped Attention for End-to-End ASR
2025cites this paper
Real-Time Language Translation Using IoT Technology
2025cites this paper
iDANSE: Iterative Data-driven Nonlinear State Estimation of Model-free Hidden Sequences
2025cites this paper
Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss
2025cites this paper
Multi-modal Streaming ASR in Cross-talk Scenario for Smart Glasses
2025cites this paper
Enhancing patient-independent detection of freezing of gait in Parkinson’s disease with deep adversarial network
2025cites this paper
Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition
2025cites this paper
Automatic Speech Recognition: Comparisons Between Convolutional Neural Networks, Hidden Markov Model and Hybrid Architecture
2025cites this paper
GADS: A Super Lightweight Model for Head Pose Estimation
2025cites this paper
A Multi Label Glaucoma Classification Using a Lightweight Attention Mechanisms
2025cites this paper
Wenzhou Dialect Speech to Mandarin Text Conversion
2025cites this paper
Empowering Dysarthric Communication: Hybrid Transformer-CTC-Based Speech Recognition System
2025cites this paper
Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR
2025cites this paper
Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation
2025cites this paper
WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing
2025cites this paper
DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition
2025cites this paper
Slide-EC: A Slide-Based Error Correction Framework for Speech Recognition
2025cites this paper
Generative artificial intelligence-based modified abstractive cross attention enabled sequence to sequence model for abstractive Hindi text summarization
2025cites this paper
Integrating Ordinary Differential Equations With Sparse Attention for Power Load Forecasting
2025cites this paper
A survey of semantic extraction for speech semantic communications: Metrics, approaches, perspectives and challenges
2025cites this paper
Multi-scale Feature Refinement via Perspective Scaling and Adaptive Regularization for text-based person search
2025cites this paper
A Cross-Modal Generation Algorithm for Temporal Force Tactile Data for Multidimensional Haptic Rendering
2025cites this paper
A multi-head adaptive actor-critic algorithm for solving vehicle routing problems
2025cites this paper
LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness
2025cites this paper
A novel deep learning model combining 3DCNN-CapsNet and hierarchical attention mechanism for EEG emotion recognition
2025cites this paper
CS2former: Multimodal feature fusion transformer with dual channel-spatial feature extraction module for bipolar disorder diagnosis
2025cites this paper
ML-PINN: A memory-efficient physics-informed Mamba-LSTM network for fast and accurate PDE solving
2025cites this paper
Noisy Disentanglement with Tri-stage Training for Noise-Robust Speech Recognition
2025cites this paper
Knowledge Distillation Method for Pruned RNN-T Models via Pruning Bounds Sharing and Losses Confusion
2025cites this paper
Positional Encoding in Transformer-Based Time Series Models: A Survey
2025cites this paper