Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor

Published 2024 in Interspeech

ABSTRACT

In recent years, advancements in automatic speech recognition (ASR) systems have led to their widespread use in applications such as call center bots and virtual assistants. However, these systems encounter challenges in adverse speech conditions, lack of contextual information, and recognizing rare words. In this paper, we propose a novel architecture to tackle these limitations by integrating Large Language Models (LLMs) and prompt mechanisms, aiming to enhance ASR accuracy. By using a pre-trained text encoder with a text adapter for task-specific adaptation and an efficient LLM-based re-prediction mechanism, our method has shown remarkable results in various real-world scenarios. Our proposed system achieves an average relative word error rate improvement of 27% for conventional tasks, 30% for utterance-level contextual tasks, and 33% for word-level biasing tasks compared to a base-line ASR system on multiple public datasets.

PUBLICATION RECORD

Publication year
2024
Venue
Interspeech
Publication date
2024-09-01
Fields of study
Computer Science
Identifiers
DOI 10.21437/interspeech.2024-1762
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

PromptASR for Contextualized ASR with Controllable Style
2023cited by this paper
Zipformer: A faster and better encoder for automatic speech recognition
2023cited by this paper
Spike-Triggered Contextual Biasing for End-to-End Mandarin Speech Recognition
2023cited by this paper
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
2023cited by this paper
Libriheavy: A 50,000 Hours ASR Corpus with Punctuation Casing and Context
2023cited by this paper
LLaMA: Open and Efficient Foundation Language Models
2023cited by this paper
Leveraging Large Language Models for Exploiting ASR Uncertainty
2023cited by this paper
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study
2023cited by this paper
Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network
2023cited by this paper
RescoreBERT: Discriminative Speech Recognition Rescoring With Bert
2022cited by this paper
Noise-Robust Speech Recognition With 10 Minutes Unparalleled In-Domain Data
2022cited by this paper
PaLM: Scaling Language Modeling with Pathways
2022cited by this paper
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
2022cited by this paper
Pruned RNN-T for fast, memory-efficient ASR training
2022cited by this paper
Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey
2021cited by this paper
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
2021cited by this paper
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion
2021cited by this paper
LoRA: Low-Rank Adaptation of Large Language Models
2021cited by this paper
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition
2021cited by this paper
Prompt-tuning in ASR systems for efficient domain-adaptation
2021cited by this paper
Deep Shallow Fusion for RNN-T Personalization
2020cited by this paper
Towards Fast and Accurate Streaming End-To-End ASR
2020cited by this paper
Common Voice: A Massively-Multilingual Speech Corpus
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
An analysis of environment, microphone and data simulation mismatches in robust speech recognition
2017cited by this paper
An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model
2017cited by this paper
Librispeech: An ASR corpus based on public domain audio books
2015cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015cited by this paper
The Design for the Wall Street Journal-based CSR Corpus
1992cited by this paper

CITED BY

Beyond Prompting: Efficient and Robust Contextual Biasing for Speech LLMs via Logit-Space Integration (LOGIC)
2026cites this paper
Scaling Multilingual Visual Speech Recognition
2025cites this paper
Automatic Speech Recognition of African American English: Lexical and Contextual Effects
2025cites this paper
Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems
2025cites this paper
N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition
2025cites this paper