Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation

Yuhao Zhang,Xiangnan Ma,Kaiqi Kou,Peizhuo Liu,Weiqiao Shan,Benyou Wang,Tong Xiao,Yuxin Huang,Zheng Yu,Jingbo Zhu

Published 2025 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using $n$-gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements over a strong baseline and achieves performance comparable to models trained with text.

PUBLICATION RECORD

Publication year
2025
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2025-05-21
Fields of study
Linguistics, Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2505.15333 arXiv 2505.15333
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Acoustic BPE for Speech Generation with Discrete Tokens
2023cited by this paper
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
2023cited by this paper
Rethinking and Improving Multi-task Learning for End-to-end Speech Translation
2023influential reference
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
2023cited by this paper
Textless Direct Speech-to-Speech Translation with Discrete Speech Representation
2022cited by this paper
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
2022cited by this paper
Wav2Seq: Pre-Training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
2022cited by this paper
Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders
2021cited by this paper
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
2021cited by this paper
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
2021cited by this paper
Direct Speech-to-Speech Translation With Discrete Units
2021cited by this paper
Textless Speech-to-Speech Translation on Real Data
2021influential reference
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
2021cited by this paper
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
2021cited by this paper
UWSpeech: Speech to Speech Translation for Unwritten Languages
2020cited by this paper
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
2020cited by this paper
Speech-to-Speech Translation Between Untranscribed Unknown Languages
2019cited by this paper
Direct speech-to-speech translation with a sequence-to-sequence model
2019cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
Attention is All you Need
2017cited by this paper
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
2006cited by this paper
Prosody Generation for Speech-to-Speech Translation
2006cited by this paper
Some approaches to statistical and finite-state speech-to-speech translation
2004cited by this paper
Finite-state speech-to-speech translation
1997cited by this paper

CITED BY

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
2025cites this paper