CLIP4IDC: CLIP for Image Difference Captioning

Published 2022 in AACL

ABSTRACT

Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor. Accordingly, two major issues may arise: (1) a large domain gap usually exists between the pre-training datasets used for training such a visual encoder and that of the downstream IDC task, and (2) the visual feature extractor, when separately encoding two images, often does not effectively encode the visual changes between two images. Due to the excellent zero-shot performance of the recently proposed CLIP, we thus propose CLIP4IDC to transfer a CLIP model for the IDC task to address those issues. Different from directly fine-tuning CLIP to generate sentences, we introduce an adaptation training process to adapt CLIP’s visual encoder to capture and align differences in image pairs based on the textual descriptions. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC.

PUBLICATION RECORD

Publication year
2022
Venue
AACL
Publication date
2022-06-01
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2206.00629 arXiv 2206.00629
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Learning by Imagination: A Joint Framework for Text-Based Image Manipulation and Change Captioning
2023influential reference
Bidirectional difference locating and semantic consistency reasoning for change captioning
2022influential reference
Image Difference Captioning With Instance-Level Fine-Grained Feature Representation
2022influential reference
Image Difference Captioning with Pre-training and Contrastive Learning
2022influential reference
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation
2021cited by this paper
CLIP4Caption: CLIP for Video Caption
2021cited by this paper
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
2021cited by this paper
Viewpoint-Agnostic Change Captioning with Cycle Consistency
2021influential reference
Image Change Captioning by Learning from an Auxiliary Task
2021influential reference
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
2021cited by this paper
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
2021cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning
2020cited by this paper
Neural Naturalist: Generating Fine-Grained Image Comparisons
2019cited by this paper
Robust Change Captioning
2019influential reference
Expressing Visual Relationships via Language
2019cited by this paper
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
2019cited by this paper
VisualBERT: A Simple and Performant Baseline for Vision and Language
2019cited by this paper
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
2019cited by this paper
Unified Vision-Language Pre-Training for Image Captioning and VQA
2019cited by this paper
Learning to Describe Differences Between Pairs of Similar Images
2018cited by this paper
Attention is All you Need
2017influential reference
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Deep Domain Confusion: Maximizing for Domain Invariance
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
CIDEr: Consensus-based image description evaluation
2014cited by this paper
A Review and Comparison of Measures for Automatic Video Surveillance Systems
2008cited by this paper
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments
2007cited by this paper
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
2005cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper

CITED BY

Vision-Language Agents for Interactive Forest Change Analysis
2026cites this paper
Dynamic Channel Decoupling and Cross-Modal Contrastive Learning for Change Captioning
2025cites this paper
Find and Perceive: Tell Visual Change with Fine-Grained Comparison
2025cites this paper
OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning
2025cites this paper
Scalable Remote Sensing Image Change Captioning using In-Context Learning
2025cites this paper
Image Difference Grounding with Natural Language
2025cites this paper
PanoSCU: A Simulation-Based Dataset for Panoramic Indoor Scene Understanding
2025cites this paper
Change Entity-guided Heterogeneous Representation Disentangling for Change Captioning
2025influential citation
Cross-Temporal Remote Sensing Image Change Captioning: A Manifold Mapping and Bayesian Diffusion Approach for Land Use Monitoring
2025cites this paper
A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning
2025cites this paper
Hybrid Visual Adapter and Drop-View Training for Change Captioning
2025cites this paper
Prompt-based Weakly-supervised Vision-language Pre-training
2025cites this paper
Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning
2024influential citation
GIFT: A Framework Towards Global Interpretable Faithful Textual Explanations of Vision Classifiers
2024influential citation
Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
2024cites this paper
Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change
2024cites this paper
DTC: Difference-aware Transformer with CLIP Adaptation for Change Captioning
2024influential citation
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
2024influential citation
Diffusion-Based Multimodal Video Captioning
2024cites this paper
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends, and Metrics Analysis
2024cites this paper
Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations
2024cites this paper
VIXEN: Visual Text Comparison Network for Image Difference Captioning
2024influential citation
EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning
2024cites this paper
Context-aware Difference Distilling for Multi-change Captioning
2024cites this paper
Towards a multimodal framework for remote sensing image change retrieval and captioning
2024cites this paper
OneDiff: A Generalist Model for Image Difference Captioning
2024cites this paper
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
2024cites this paper
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
2024cites this paper
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method
2024cites this paper
TAME-RD: Text Assisted Replication of Image Multi-Adjustments for Reverse Designing
2024cites this paper
Finetuning CLIP to Reason about Pairwise Differences
2024cites this paper
The STVchrono Dataset: Towards Continuous Change Recognition in Time
2024cites this paper
Changes to Captions: An Attentive Network for Remote Sensing Change Captioning
2023cites this paper
Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding
2023cites this paper
PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
2023cites this paper
Communication breakdown: On the low mutual intelligibility between human and neural captioning
2022cites this paper
ImProvShow: Multimodal Fusion for Image Provenance Summarization
year unknowncites this paper
DiffTell : A High-Quality Dataset for Describing Image Manipulation Changes
year unknowncites this paper
CLIP TO R EASON ABOUT P AIRWISE
year unknowncites this paper
VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos
year unknowncites this paper
: Improving Zero-Shot Object-Level Change Detection
year unknowncites this paper