Reasoning-Augmented Representations for Multimodal Retrieval

Jianrui Zhang,A. Rajan,Brandon Han,Soochahn Lee,Sukanta Ganguly,Yong Jae Lee

Published 2026 in Unknown venue

ABSTRACT

Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry"silent"evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-06
Fields of study
Computer Science
Identifiers
arXiv 2602.07125
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
2025cited by this paper
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
2025cited by this paper
Unifying Multimodal Retrieval via Document Screenshot Embedding
2024cited by this paper
E5-V: Universal Embeddings with Multimodal Large Language Models
2024influential reference
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
2024cited by this paper
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
2024influential reference
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
2024influential reference
Sigmoid Loss for Language Image Pre-Training
2023influential reference
Visual Instruction Tuning
2023cited by this paper
ImageBind One Embedding Space to Bind Them All
2023cited by this paper
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
2023influential reference
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
2020cited by this paper

CITED BY

No citing papers are available for this paper.