Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT

Zeeshan Ahmed,Frank Seide,Niko Moritz,Ju Lin,Ruiming Xie,Simone Merello,Zhe Liu,Christian Fuegen

Published 2025 in arXiv.org

ABSTRACT

This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurrent Neural Network Transducers (RNN-T) can perform real-time transcription, achieving streaming translation in real-time remains a significant challenge. To address this issue, we propose a simultaneous translation approach that effectively balances translation quality and latency. We also investigate efficient integration of ASR and MT, leveraging linguistic cues generated by the ASR system to manage context and utilizing efficient beam-search pruning techniques such as time-out and forced finalization to maintain system's real-time factor. We apply our approach to an on-device bilingual conversational speech translation and demonstrate that our techniques outperform baselines in terms of latency and quality. Notably, our technique narrows the quality gap with non-streaming translation systems, paving the way for more accurate and efficient real-time speech translation.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-08-18
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2508.13358 arXiv 2508.13358
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation
2024cited by this paper
NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task
2024cited by this paper
Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
2024cited by this paper
AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition
2024cited by this paper
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
2024cited by this paper
Seamless: Multilingual Expressive and Streaming Speech Translation
2023cited by this paper
Directional Speech Recognition for Speaker Disambiguation and Cross-talk Suppression
2023cited by this paper
Glancing Future for Simultaneous Machine Translation
2023cited by this paper
Segmentation-Free Streaming Machine Translation
2023cited by this paper
Efficient Monotonic Multihead Attention
2023influential reference
Attention as a Guide for Simultaneous Speech Translation
2022cited by this paper
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
2022cited by this paper
FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech
2022cited by this paper
Translation-based Supervision for Policy Generation in Simultaneous Neural Machine Translation
2021cited by this paper
Streaming cascade-based speech translation leveraged by a direct segmentation model
2021cited by this paper
Replacing Human Audio with Synthetic Audio for on-Device Unspoken Punctuation Prediction
2020cited by this paper
Beyond English-Centric Multilingual Machine Translation
2020cited by this paper
Sentence Boundary Augmentation for Neural Machine Translation Robustness
2020cited by this paper
Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
2019cited by this paper
Monotonic Multihead Attention
2019cited by this paper
A Comparative Study on End-to-End Speech to Text Translation
2019cited by this paper
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
2019cited by this paper
Re-Translation Strategies for Long Form, Simultaneous, Spoken Language Translation
2019cited by this paper
Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation
2018cited by this paper
You May Not Need Attention
2018cited by this paper
STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework
2018cited by this paper
Six Challenges for Neural Machine Translation
2017cited by this paper
Attention is All you Need
2017cited by this paper
Online and Linear-Time Attention by Enforcing Monotonic Alignments
2017influential reference
Monotonic Chunkwise Attention
2017cited by this paper
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
2016cited by this paper
Can neural machine translation do simultaneous translation?
2016cited by this paper
Learning to Translate in Real-time with Neural Machine Translation
2016cited by this paper
Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs
2015cited by this paper
Don't Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation
2014cited by this paper
On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
2014cited by this paper
Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation
2014cited by this paper

CITED BY

No citing papers are available for this paper.