Video-guided Machine Translation (VMT) aims to improve translation quality by integrating contextual information from paired short video clips. Mainstream VMT approaches typically incorporate multimodal information by uniformly sampling frames from the input videos. However, this paradigm frequently incurs significant computational overhead and introduces redundant multimodal content, which degrades both efficiency and translation quality. To tackle these challenges, we propose SHIFT ( S elected H elpful I nformative F rame for T ranslation). It is a lightweight, plug-and-play framework designed for VMT with Multimodal Large Language Models (MLLMs). SHIFT adaptively selects a single informative key frame when visual context is necessary; otherwise, it relies solely on textual input. This process is guided by a dedicated clustering module and a selector module. Experimental results demonstrate that SHIFT enhances the performance of MLLMs on the VMT task while simultaneously reducing computational cost, without sacrificing generalization ability.
SHIFT: Selected Helpful Informative Frame for Video-guided Machine Translation
Boyu Guan,Chuang Han,Yining Zhang,Yupu Liang,Zhiyang Zhang,Yang Zhao,Chengqing Zong
Published 2025 in Conference on Empirical Methods in Natural Language Processing
ABSTRACT
PUBLICATION RECORD
- Publication year
2025
- Venue
Conference on Empirical Methods in Natural Language Processing
- Publication date
Unknown publication date
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-62 of 62 references · Page 1 of 1
CITED BY
- No citing papers are available for this paper.
Showing 0-0 of 0 citing papers · Page 1 of 1