Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Yuhao Sun,C. Cai,Jiacheng Zhang,Zesheng Ye,Xin Yuan,Feng Liu

Published 2026 in arXiv.org

ABSTRACT

Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.

PUBLICATION RECORD

Publication year
2026
Venue
arXiv.org
Publication date
2026-01-28
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2601.20419 arXiv 2601.20419
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Attribute-based Visual Reprogramming for Image Classification with CLIP
2025cited by this paper
Attention Guided Alignment in Efficient Vision-Language Models
2025cited by this paper
Global and Local Vision-Language Alignment for Few-Shot Learning and Few-Shot OOD Detection
2025cited by this paper
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models
2025cited by this paper
Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
2025cited by this paper
Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction
2025cited by this paper
Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts
2025cited by this paper
From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection
2025cited by this paper
Generalizable Prompt Learning of CLIP: A Brief Overview
2025cited by this paper
Sample-specific Masks for Visual Reprogramming-based Prompting
2024cited by this paper
Bayesian-guided Label Mapping for Visual Reprogramming
2024cited by this paper
LightRAG: Simple and Fast Retrieval-Augmented Generation
2024cited by this paper
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
2024cited by this paper
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
2024influential reference
RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation
2024cited by this paper
π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation
2023cited by this paper
Waffling around for Performance: Visual Classification with Random Words and Broad Concepts
2023cited by this paper
POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models
2023cited by this paper
Sigmoid Loss for Language Image Pre-Training
2023cited by this paper
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
2022cited by this paper
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
2022cited by this paper
GroupViT: Semantic Segmentation Emerges from Text Supervision
2022cited by this paper
Flamingo: a Visual Language Model for Few-Shot Learning
2022cited by this paper
CoCa: Contrastive Captioners are Image-Text Foundation Models
2022cited by this paper
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression
2022cited by this paper
What does a platypus look like? Generating customized prompts for zero-shot image classification
2022influential reference
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
2022cited by this paper
Learning to Decompose Visual Features with Latent Textual Prompts
2022cited by this paper
Visual Classification via Description from Large Language Models
2022influential reference
Florence: A New Foundation Model for Computer Vision
2021cited by this paper
A statistical interpretation of term specificity and its application in retrieval
2021cited by this paper
FLAVA: A Foundational Language And Vision Alignment Model
2021cited by this paper
Learning to Prompt for Vision-Language Models
2021cited by this paper
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
2021cited by this paper
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
2021cited by this paper
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
2021influential reference
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
2021cited by this paper
Unifying Vision-and-Language Tasks via Text Generation
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
Prefix-Tuning: Optimizing Continuous Prompts for Generation
2021cited by this paper
ALICE: Active Learning with Contrastive Natural Language Explanations
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
Places: A 10 Million Image Database for Scene Recognition
2018cited by this paper
Fine-Grained Image Classification via Combining Vision and Language
2017cited by this paper
Food-101 - Mining Discriminative Components with Random Forests
2014influential reference
Describing Textures in the Wild
2013influential reference
ImageNet: A large-scale hierarchical image database
2009influential reference
cats and dogs
2003influential reference

CITED BY

No citing papers are available for this paper.