Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems

Mingyu Cui,Mengzhe Geng,Jiajun Deng,Chengxi Deng,Jiawen Kang,Shujie Hu,Guinan Li,Tianzi Wang,Zhaoqing Li,Xie Chen,Xunying Liu

Published 2025 in IEEE Transactions on Audio, Speech, and Language Processing

ABSTRACT

This paper investigates four types of cross-utterance speech contexts modeling approaches for streaming and non-streaming Conformer-Transformer (C-T) ASR systems: i) input audio feature concatenation; ii) cross-utterance Encoder embedding concatenation; iii) cross-utterance Encoder embedding pooling projection; or iv) a novel chunk-based approach applied to C-T models for the first time. An efficient batch-training scheme is proposed for contextual C-Ts that uses spliced speech utterances within each minibatch to minimize the synchronization overhead while preserving the sequential order of cross-utterance speech contexts. Experiments are conducted on four benchmark speech datasets across three languages: the English GigaSpeech and Mandarin Wenetspeech corpora used in contextual C-T models pre-training; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets used in domain fine-tuning. The best performing contextual C-T systems consistently outperform their respective baselines using no cross-utterance speech contexts in pre-training and fine-tuning stages with statistically significant average word error rate (WER) or character error rate (CER) reductions up to 0.9%, 1.1%, 0.51%, and 0.98% absolute (6.0%, 5.4%, 2.0%, and 3.4% relative) on the four tasks respectively. Their performance competitiveness against Wav2vec2.0-Conformer, XLSR-128, and Whisper models highlights the potential benefit of incorporating cross-utterance speech contexts into current speech foundation models.

PUBLICATION RECORD

Publication year
2025
Venue
IEEE Transactions on Audio, Speech, and Language Processing
Publication date
2025-08-14
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1109/TASLPRO.2025.3606235 arXiv 2508.10456
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI
2025cited by this paper
Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor
2024cited by this paper
Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models
2024cited by this paper
Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions
2024cited by this paper
Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation
2024cited by this paper
CTC-Assisted LLM-Based Contextual ASR
2024cited by this paper
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
2024cited by this paper
Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition
2024influential reference
Optimizing Byte-Level Representation For End-To-End ASR
2024cited by this paper
Contextual Modeling for Document-level ASR Error Correction
2024cited by this paper
Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR
2024cited by this paper
Homogeneous Speaker Features for on-the-Fly Dysarthric and Elderly Speaker Adaptation and Speech Recognition
2024cited by this paper
Contextualization of ASR with LLM using phonetic retrieval-based augmentation
2024cited by this paper
Advanced Long-Content Speech Recognition With Factorized Neural Transducer
2024cited by this paper
Using Large Language Model for End-to-End Chinese ASR and NER
2024cited by this paper
Towards ASR Robust Spoken Language Understanding Through in-Context Learning with Word Confusion Networks
2024cited by this paper
Improving ASR Contextual Biasing with Guided Attention
2024cited by this paper
Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems
2023influential reference
Phoneme-Aware Encoding for Prefix-Tree-Based Contextual ASR
2023cited by this paper
Conversational Speech Recognition by Learning Audio-Textual Cross-Modal Contextual Representation
2023cited by this paper
Zipformer: A faster and better encoder for automatic speech recognition
2023cited by this paper
End-to-End Speech Recognition Contextualization with Large Language Models
2023cited by this paper
Semi-Autoregressive Streaming ASR with Label Context
2023cited by this paper
CPPF: A contextual and post-processing-free model for automatic speech recognition
2023cited by this paper
PromptASR for Contextualized ASR with Controllable Style
2023cited by this paper
Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR
2023cited by this paper
Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition
2023cited by this paper
CopyNE: Better Contextual ASR by Copying Named Entities
2023cited by this paper
CASA-ASR: Context-Aware Speaker-Attributed ASR
2023cited by this paper
Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network
2023cited by this paper
Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion
2023influential reference
Diagonal State Space Augmented Transformers for Speech Recognition
2023cited by this paper
Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition
2023cited by this paper
Confidence Score Based Conformer Speaker Adaptation for Speech Recognition
2022cited by this paper
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer
2022cited by this paper
Robust Speech Recognition via Large-Scale Weak Supervision
2022influential reference
Bring dialogue-context into RNN-T for streaming ASR
2022influential reference
Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism
2022influential reference
Efficient Training of Neural Transducer for Speech Recognition
2022cited by this paper
Improving the Training Recipe for a Robust Conformer-based Hybrid Model
2022cited by this paper
Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection
2022cited by this paper
Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition
2022cited by this paper
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
2022cited by this paper
Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition
2022cited by this paper
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
2021cited by this paper
Transformer Language Models with LSTM-Based Cross-Utterance Information Representation
2021cited by this paper
Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers
2021influential reference
On the limit of English conversational speech recognition
2021cited by this paper
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
2021cited by this paper
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio
2021influential reference
Development of the Cuhk Elderly Speech Recognition System for Neurocognitive Disorder Detection Using the Dementiabank Corpus
2021cited by this paper
WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition
2021influential reference
Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition
2021cited by this paper
Recent Advances in End-to-End Automatic Speech Recognition
2021cited by this paper
Context-Aware Transformer Transducer for Speech Recognition
2021influential reference
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
2021cited by this paper
Speaker Turn Aware Similarity Scoring for Diarization of Speech-Based Cognitive Assessments
2021cited by this paper
Large-Context Automatic Speech Recognition Based on RNN Transducer
2021influential reference
End-to-End Speech Recognition on Conversations
2020cited by this paper
Improving RNN-T ASR Accuracy Using Context Audio
2020influential reference
Transformer-Based Long-Context End-to-End Speech Recognition
2020influential reference
Recent Developments on Espnet Toolkit Boosted By Conformer
2020cited by this paper
LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition
2020cited by this paper
Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset
2020cited by this paper
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
2020cited by this paper
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
2020cited by this paper
Conformer: Convolution-augmented Transformer for Speech Recognition
2020cited by this paper
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
2020cited by this paper
Alzheimer's Dementia Recognition through Spontaneous Speech: The ADReSS Challenge
2020cited by this paper
Longformer: The Long-Document Transformer
2020cited by this paper
Training Language Models for Long-Span Cross-Sentence Evaluation
2019cited by this paper
Speaker-Aware Speech-Transformer
2019cited by this paper
Compressive Transformers for Long-Range Sequence Modelling
2019cited by this paper
Transformer-Transducer: End-to-End Speech Recognition with Self-Attention
2019cited by this paper
Transformer ASR with Contextual Block Processing
2019influential reference
A Comparative Study on Transformer vs RNN in Speech Applications
2019cited by this paper
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
2019influential reference
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
2019cited by this paper
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
2019cited by this paper
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
2019cited by this paper
Recurrent Neural Networks
2018cited by this paper
Session-level Language Modeling for Conversational Speech
2018cited by this paper
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
2018cited by this paper
Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network
2018cited by this paper
Dialog-Context Aware end-to-end Speech Recognition
2018influential reference
ESPnet: End-to-End Speech Processing Toolkit
2018cited by this paper
Attention is All you Need
2017cited by this paper
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
2017cited by this paper
Language Modeling with Gated Convolutional Networks
2016cited by this paper
Attentive Pooling Networks
2016cited by this paper
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
2015cited by this paper
Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition
2014cited by this paper
Use of contexts in language model interpolation and adaptation
2013cited by this paper
Sequence Transduction with Recurrent Neural Networks
2012cited by this paper
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
2006cited by this paper
The natural history of Alzheimer's disease. Description of study cohort and accuracy of diagnosis.
1994cited by this paper
Tools for the analysis of benchmark speech recognition tests
1990cited by this paper
Some statistical issues in the comparison of speech recognition algorithms
1989cited by this paper
Speech
1933cited by this paper

CITED BY

No citing papers are available for this paper.