LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval

Jian Zhang,Junyi Guo,Junyi Yuan,Huanda Lu,Yanlin Zhou,Fangyu Wu,Qiufeng Wang,Dongming Lu

Published 2025 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose $C^3$, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. $C^3$ introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that $C^3$ achieves state-of-the-art performance in both fine-tuned and zero-shot settings.

PUBLICATION RECORD

Publication year
2025
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2025-11-09
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/2025.emnlp-main.980 arXiv 2511.06268
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

GOAL: Global-local Object Alignment Learning
2025cited by this paper
Multi-Modal Reference Learning for Fine-Grained Text-to-Image Retrieval
2025cited by this paper
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
2025influential reference
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
2025cited by this paper
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution
2025influential reference
Continual learning for cross-modal image-text retrieval based on domain-selective attention
2024cited by this paper
PhiloGPT: A Philology-Oriented Large Language Model for Ancient Chinese Manuscripts with Dunhuang as Case Study
2024cited by this paper
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
2024cited by this paper
Unified Lexical Representation for Interpretable Visual-Language Alignment
2024cited by this paper
Discriminative Feature Enhancement Network for few-shot classification and beyond
2024cited by this paper
CoT-based Data Augmentation Strategy for Persuasion Techniques Detection
2024cited by this paper
LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model
2024cited by this paper
Long-CLIP: Unlocking the Long-Text Capability of CLIP
2024cited by this paper
Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval
2024cited by this paper
Improving Audio Captioning Models with Fine-Grained Audio Features, Text Embedding Supervision, and LLM Mix-Up Augmentation
2023cited by this paper
Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data
2023cited by this paper
Improving CLIP Training with Language Rewrites
2023cited by this paper
Fine-grained Image-text Matching by Cross-modal Hard Aligning Network
2023cited by this paper
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval
2023cited by this paper
Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective
2023cited by this paper
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
2022influential reference
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
2022cited by this paper
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
2021influential reference
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
Stacked Cross Attention for Image-Text Matching
2018influential reference
Cultural Heritage Preservation : The Past, the Present and the Future
2018cited by this paper
Get To The Point: Summarization with Pointer-Generator Networks
2017cited by this paper
Modeling Coverage for Neural Machine Translation
2016cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014cited by this paper

CITED BY

No citing papers are available for this paper.