Multi-Modal Retrieval For Large Language Model Based Speech Recognition

J. Kolehmainen,Aditya Gourav,Prashanth Gurunath Shivakumar,Yile Gu,Ankur Gandhe,A. Rastrow,Grant P. Strimel,I. Bulyko

Published 2024 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We demonstrate the effectiveness of our retrieval approaches empirically by applying them to automatic speech recognition tasks with access to external information. Under this setting, we show that speech-based multi-modal retrieval outperforms text based retrieval, and yields up to 50 % improvement in word error rate over the multi-modal language model baseline. Furthermore, we achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.

PUBLICATION RECORD

Publication year
2024
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2024-06-13
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2406.09618 arXiv 2406.09618
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
2023cited by this paper
AudioPaLM: A Large Language Model That Can Speak and Listen
2023cited by this paper
Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding
2023cited by this paper
FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model
2023cited by this paper
LLaMA: Open and Efficient Foundation Language Models
2023influential reference
Augmented Language Models: a Survey
2023cited by this paper
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks
2023cited by this paper
Domain Adaptation with External Off-Policy Acoustic Catalogs for Scalable Contextual End-to-End Automated Speech Recognition
2023cited by this paper
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study
2023cited by this paper
On-the-Fly Text Retrieval for end-to-end ASR Adaptation
2023cited by this paper
Training Language Models with Memory Augmentation
2022cited by this paper
Robust Speech Recognition via Large-Scale Weak Supervision
2022cited by this paper
OPT: Open Pre-trained Transformer Language Models
2022influential reference
Contextual Adapters for Personalized Speech Recognition in Neural Transducers
2022cited by this paper
Context-Aware Transformer Transducer for Speech Recognition
2021cited by this paper
Domain-Aware Neural Language Models for Speech Recognition
2021cited by this paper
Personalization Strategies for End-to-End Speech Recognition Systems
2021cited by this paper
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
2021cited by this paper
The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
2021cited by this paper
SLUE: New Benchmark Tasks For Spoken Language Understanding Evaluation on Natural Speech
2021cited by this paper
Memformer: A Memory-Augmented Transformer for Sequence Modeling
2020cited by this paper
Deep Shallow Fusion for RNN-T Personalization
2020cited by this paper
MLS: A Large-Scale Multilingual Dataset for Speech Research
2020influential reference
CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus
2020cited by this paper
Dense Passage Retrieval for Open-Domain Question Answering
2020cited by this paper
REALM: Retrieval-Augmented Language Model Pre-Training
2020cited by this paper
On Layer Normalization in the Transformer Architecture
2020cited by this paper
Generalization through Memorization: Nearest Neighbor Language Models
2019influential reference
Libri-Light: A Benchmark for ASR with Limited or No Supervision
2019cited by this paper
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
2019cited by this paper
Shallow-Fusion End-to-End Contextual Biasing
2019cited by this paper
TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation
2018cited by this paper
Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension
2018cited by this paper
Billion-Scale Similarity Search with GPUs
2017cited by this paper

CITED BY

Contextual ASR with Retrieval Augmented Large Language Model
2025cites this paper
NoLoCo: No-all-reduce Low Communication Training Method for Large Models
2025cites this paper
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
2024cites this paper
Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval
2024cites this paper
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
2024cites this paper