A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

Kyle Buettner,Jacob Emmerson,Adriana Kovashka

Published 2025 in IJCNLP-AACL

ABSTRACT

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures. Modern vision-language models (VLMs) gain understanding of images with text in different languages often through training on machine translations of English captions. However, this process relies on input content written from the perception of English speakers, leading to a perceptual bias. In this work, we outline a framework to address this bias. We specifically use a small amount of native speaker data, nearest-neighbor example guidance, and multimodal LLM reasoning to augment captions to better reflect descriptions in a target language. When adding the resulting rewrites to multilingual CLIP finetuning, we improve on German and Japanese text-image retrieval case studies (up to +3.5 mean recall, +4.4 on native vs. translation errors). We also propose a mechanism to build understanding of object description variation across languages, and offer insights into cross-dataset and cross-language generalization.

PUBLICATION RECORD

Publication year
2025
Venue
IJCNLP-AACL
Publication date
2025-04-19
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 2504.14359
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval
2024influential reference
PAELLA: Parameter-Efficient Lightweight Language-Agnostic Captioning Model
2024cited by this paper
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
2024cited by this paper
On Scaling Up a Multilingual Vision and Language Model
2024cited by this paper
Improving CLIP Training with Language Rewrites
2023influential reference
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
2023cited by this paper
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning
2023cited by this paper
LLaMA: Open and Efficient Foundation Language Models
2023influential reference
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
2023cited by this paper
mCLIP: Multilingual CLIP via Cross-lingual Transfer
2023cited by this paper
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
2023cited by this paper
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset
2022influential reference
No Language Left Behind: Scaling Human-Centered Machine Translation
2022cited by this paper
PaLI: A Jointly-Scaled Multilingual Language-Image Model
2022influential reference
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Visually Grounded Reasoning across Languages and Cultures
2021cited by this paper
OPUS-MT – Building open translation services for the World
2020cited by this paper
Multimodal Transformer for Multimodal Machine Translation
2020cited by this paper
Linguistic Relativity
2020cited by this paper
Nearest Neighbor Machine Translation
2020cited by this paper
Unsupervised Cross-lingual Representation Learning at Scale
2019cited by this paper
Lessons Learned in Multilingual Grounded Language Learning
2018cited by this paper
STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset
2017cited by this paper
Multi30K: Multilingual English-German Image Descriptions
2016cited by this paper
From Large Scale Image Categorization to Entry-Level Categories
2013cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
WordNet: A Lexical Database for English
1995influential reference

CITED BY

No citing papers are available for this paper.