M4-BLIP: Advancing Multi-Modal Media Manipulation Detection through Face-Enhanced Local Analysis

Han Wu,Ke Sun,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji

Published 2025 in arXiv.org

ABSTRACT

In the contemporary digital landscape, multi-modal media manipulation has emerged as a significant societal threat, impacting the reliability and integrity of information dissemination. Current detection methodologies in this domain often overlook the crucial aspect of localized information, despite the fact that manipulations frequently occur in specific areas, particularly in facial regions. In response to this critical observation, we propose the M4-BLIP framework. This innovative framework utilizes the BLIP-2 model, renowned for its ability to extract local features, as the cornerstone for feature extraction. Complementing this, we incorporate local facial information as prior knowledge. A specially designed alignment and fusion module within M4-BLIP meticulously integrates these local and global features, creating a harmonious blend that enhances detection accuracy. Furthermore, our approach seamlessly integrates with Large Language Models (LLM), significantly improving the interpretability of the detection outcomes. Extensive quantitative and visualization experiments validate the effectiveness of our framework against the state-of-the-art competitors.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-12-01
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2512.01214 arXiv 2512.01214
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
2023cited by this paper
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
2023cited by this paper
Detecting and Grounding Multi-Modal Media Manipulation
2023influential reference
Cross-modal Contrastive Learning for Multimodal Fake News Detection
2023cited by this paper
LLaMA: Open and Efficient Foundation Language Models
2023cited by this paper
Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection
2022cited by this paper
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
2022cited by this paper
Multimodal Fake News Detection via CLIP-Guided Learning
2022cited by this paper
Game-on: graph attention network based multimodal fusion for fake news detection
2022cited by this paper
Multi-attentional Deepfake Detection
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
2021cited by this paper
Generalizing Face Forgery Detection with High-frequency Features
2021cited by this paper
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
2021cited by this paper
Information Bottleneck Disentanglement for Identity Swapping
2021cited by this paper
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
2021cited by this paper
High-Fidelity GAN Inversion for Image Attribute Editing
2021cited by this paper
General Facial Representation Learning in a Visual-Linguistic Manner
2021cited by this paper
Multimodal Fake News Detection
2021cited by this paper
SimSwap: An Efficient Framework For High Fidelity Face Swapping
2020cited by this paper
LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
2020cited by this paper
FaceForensics++: Learning to Detect Manipulated Facial Images
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Momentum Contrast for Unsupervised Visual Representation Learning
2019cited by this paper
SpotFake: A Multi-modal Framework for Fake News Detection
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
2019cited by this paper
MVAE: Multimodal Variational Autoencoder for Fake News Detection
2019cited by this paper
EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection
2018cited by this paper
Two-Stream Neural Networks for Tampered Face Detection
2017cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
Automatic deception detection: Methods for finding fake news
2015cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper

CITED BY

No citing papers are available for this paper.