SHIFT: Selected Helpful Informative Frame for Video-guided Machine Translation

Boyu Guan,Chuang Han,Yining Zhang,Yupu Liang,Zhiyang Zhang,Yang Zhao,Chengqing Zong

Published 2025 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Video-guided Machine Translation (VMT) aims to improve translation quality by integrating contextual information from paired short video clips. Mainstream VMT approaches typically incorporate multimodal information by uniformly sampling frames from the input videos. However, this paradigm frequently incurs significant computational overhead and introduces redundant multimodal content, which degrades both efficiency and translation quality. To tackle these challenges, we propose SHIFT ( S elected H elpful I nformative F rame for T ranslation). It is a lightweight, plug-and-play framework designed for VMT with Multimodal Large Language Models (MLLMs). SHIFT adaptively selects a single informative key frame when visual context is necessary; otherwise, it relies solely on textual input. This process is guided by a dedicated clustering module and a selector module. Experimental results demonstrate that SHIFT enhances the performance of MLLMs on the VMT task while simultaneously reducing computational cost, without sacrificing generalization ability.

PUBLICATION RECORD

Publication year
2025
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
Unknown publication date
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/2025.emnlp-main.161
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
2025cited by this paper
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
2025influential reference
Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis
2025influential reference
TriFine: A Large-Scale Dataset of Vision-Audio-Subtitle for Tri-Modal Machine Translation and Benchmark with Fine-Grained Annotated Tags
2025influential reference
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
2025cited by this paper
Understand Layout and Translate Text: Unified Feature-Conductive End-to-End Document Image Translation
2025cited by this paper
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
2025cited by this paper
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
2025cited by this paper
Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation
2025cited by this paper
Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation
2025cited by this paper
Detect, Disambiguate, and Translate: On-Demand Visual Reasoning for Multimodal Machine Translation with Large Vision-Language Models
2025cited by this paper
Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA
2024cited by this paper
Language Imbalance Driven Rewarding for Multilingual Self-improving
2024cited by this paper
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning
2024cited by this paper
Soul-Mix: Enhancing Multimodal Machine Translation with Manifold Mixup
2024cited by this paper
Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets
2024cited by this paper
MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models
2024cited by this paper
Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling
2024cited by this paper
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA
2024cited by this paper
The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities
2024cited by this paper
LAMBDA: Large Language Model-Based Data Augmentation for Multi-Modal Machine Translation
2024cited by this paper
The Effects of Pretraining in Video-Guided Machine Translation
2024influential reference
Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
Sigmoid Loss for Language Image Pre-Training
2023influential reference
Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination
2023cited by this paper
BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation
2023cited by this paper
CFSum Coarse-to-Fine Contribution Network for Multimodal Summarization
2023cited by this paper
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning
2023cited by this paper
Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation
2022cited by this paper
VISA: An Ambiguous Subtitles Dataset for Visual Scene-aware Machine Translation
2022cited by this paper
MSCTD: A Multimodal Sentiment Chat Translation Dataset
2022cited by this paper
Video Question Answering: Datasets, Algorithms and Challenges
2022cited by this paper
Neural Machine Translation with Phrase-Level Universal Visual Representations
2022cited by this paper
Why Videos Do Not Guide Translations in Video-guided Machine Translation? An Empirical Evaluation of Video-guided Machine Translation Dataset
2022cited by this paper
Video Pivoting Unsupervised Multi-Modal Machine Translation
2022cited by this paper
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
2022cited by this paper
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
Video-guided Machine Translation with Spatial Hierarchical Attention Network
2021cited by this paper
Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation
2021cited by this paper
BLEURT: Learning Robust Metrics for Text Generation
2020influential reference
Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding
2020cited by this paper
Dynamic Context-guided Capsule Network for Multimodal Machine Translation
2020cited by this paper
DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
2020cited by this paper
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
2019cited by this paper
Findings of the 2019 Conference on Machine Translation (WMT19)
2019cited by this paper
Mixed Precision Training of Convolutional Neural Networks using Integer Operations
2018cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
Decoupled Weight Decay Regularization
2017influential reference
Attention is All you Need
2017influential reference
Multi30K: Multilingual English-German Image Descriptions
2016cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios
2007cited by this paper
Learning to rank using gradient descent
2005cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Diatom autofocusing in brightfield microscopy: a comparative study
2000cited by this paper
Approximation by superpositions of a sigmoidal function
1989cited by this paper
Learning representations by back-propagating errors
1986cited by this paper
Least squares quantization in PCM
1982cited by this paper
Some methods for classification and analysis of multivariate observations
1967cited by this paper
International
1964cited by this paper

CITED BY

No citing papers are available for this paper.