Speech to Speech Translation with Translatotron: A State of the Art Review

J. R. Kala,E. Adetiba,Abdultaofeek Abayomi,O. Dare,A. Ifijeh

Published 2025 in Results in Engineering

ABSTRACT

A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results. Translatotron2 was an improved version of Translatotron 1 with results similar to the cascade model. Translatotron 3 the latest version of the model is better than the cascade model at some points. In this paper, a complete review of speech-to-speech translation will be presented, with a particular focus on all the versions of Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.

PUBLICATION RECORD

Publication year
2025
Venue
Results in Engineering
Publication date
2025-02-09
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.48550/arXiv.2502.05980 arXiv 2502.05980
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Automated speech therapy through personalized pronunciation correction using reinforcement learning and large language models
2025cited by this paper
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
2024cited by this paper
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
2024cited by this paper
Translatotron 3: Speech to Speech Translation with Monolingual Data
2023cited by this paper
Translation Performance from the User's Perspective of Large Language Models and Neural Machine Translation Systems
2023cited by this paper
ÌròyìnSpeech: A Multi-purpose Yorùbá Speech Corpus
2023cited by this paper
Automatic Image Captioning Combining Natural Language Processing and Deep Neural Networks
2023cited by this paper
Speech-to-Speech Translation For A Real-world Unwritten Language
2022cited by this paper
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
2022cited by this paper
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
2022cited by this paper
Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?
2021cited by this paper
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
2021cited by this paper
English to Yoruba short message service speech and text translator for android phones
2021cited by this paper
Direct Speech-to-Speech Translation With Discrete Units
2021cited by this paper
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
2021cited by this paper
Developing an Open-Source Corpus of Yoruba Speech
2020cited by this paper
Direct speech-to-speech translation with a sequence-to-sequence model
2019cited by this paper
Real-Time Translation of Indian Sign Language using LSTM
2019cited by this paper
Neural Machine Translation: A Review
2019cited by this paper
An Application for Building a Polish Telephone Speech Corpus
2018cited by this paper
End-to-End Speech Translation with the Transformer
2018cited by this paper
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
2017cited by this paper
Word Translation Without Parallel Data
2017cited by this paper
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
2016cited by this paper
Selection Criteria for Low Resource Language Programs
2016cited by this paper
The multimodal approach in audiovisual translation
2016cited by this paper
Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation
2015cited by this paper
Statistical Machine Translation
2014cited by this paper
Automatic Speech Recognition: A Deep Learning Approach
2014cited by this paper
Ethnologue
2010cited by this paper
The ATR Multilingual Speech-to-Speech Translation System
2006cited by this paper
Verbmobil: Foundations of Speech-to-Speech Translation
2000cited by this paper
An introduction to text-to-speech synthesis
1997cited by this paper
Finite-state speech-to-speech translation
1997cited by this paper
Systems of prosodic and paralinguistic features in English / by David Crystall and Randolph Quirk
1964cited by this paper

CITED BY

No citing papers are available for this paper.