Cross-Modality Distillation for Multi-Modal Tracking

Tianlu Zhang,Qiang Zhang,K. Debattista,Jungong Han

Published 2025 in IEEE Transactions on Pattern Analysis and Machine Intelligence

ABSTRACT

Contemporary multi-modal trackers achieve strong performance by leveraging complex backbones and fusion strategies, but this comes at the cost of computational efficiency, limiting their deployment in resource-constrained settings. On the other hand, compact multi-modal trackers are more efficient but often suffer from reduced performance due to limited feature representation. To mitigate the performance gap between compact and more complex trackers, we introduce a cross-modality distillation framework. This framework includes a complementarity-aware mask autoencoder designed to enhance cross-modal interactions by selectively masking patches within a modality, thereby forcing the model to learn more robust multi-modal representations. Additionally, we present a specific-common feature distillation module that transfers both modality-specific and shared information from a more powerful model’s backbone to the compact model. Moreover, we develop a multi-path selection distillation module to guide a simple fusion module in learning more accurate multi-modal information from a sophisticated fusion mechanism using multiple paths. Extensive experiments on six multi-modal tracking benchmarks demonstrate that the proposed tracker, despite being lightweight, outperforms most state-of-the-art methods, highlighting its effectiveness. Notably, our tiny variant achieves a PR score of 67.5% on LasHeR, a PR score of 58.5% on DepthTrack, and a PR score of 73.1% on VisEvent with only 6.5 M parameters, while operating at 126 FPS on an NVIDIA 2080Ti GPU.

PUBLICATION RECORD

Publication year
2025
Venue
IEEE Transactions on Pattern Analysis and Machine Intelligence
Publication date
2025-03-28
Fields of study
Medicine, Computer Science, Engineering
Identifiers
DOI 10.1109/TPAMI.2025.3555485 PMID 40153285
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning
2024influential reference
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
2024cited by this paper
RGBT Tracking via Challenge-Based Appearance Disentanglement and Interaction
2024cited by this paper
Isomorphic Pruning for Vision Models
2024cited by this paper
Exploring target-related information with reliable global pixel relationships for robust RGB-T tracking
2024cited by this paper
MixFormer: End-to-End Tracking With Iterative Mixed Attention
2024cited by this paper
Exploring Multi-Modal Spatial–Temporal Contexts for High-Performance RGB-T Tracking
2024cited by this paper
SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking
2024influential reference
Visual Adapt for RGBD Tracking
2024cited by this paper
Bridging Search Region Interaction with Template for RGB-T Tracking
2023influential reference
Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking
2023cited by this paper
Exploring the potential of Siamese network for RGBT object tracking
2023cited by this paper
Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers
2023influential reference
MixFormerV2: Efficient Fully Transformer Tracking
2023cited by this paper
ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
2023cited by this paper
RGBT Tracking via Progressive Fusion Transformer With Dynamically Guided Learning
2023influential reference
Visual Prompt Multi-Modal Tracking
2023influential reference
A Universal Event-Based Plug-In Module for Visual Object Tracking in Degraded Conditions
2023cited by this paper
Bi-directional Adapter for Multi-modal Tracking
2023influential reference
Single-Model and Any-Modality for Video Object Tracking
2023cited by this paper
Lightweight Full-Convolutional Siamese Tracker
2023cited by this paper
Separable Self and Mixed Attention Transformers for Efficient Object Tracking
2023cited by this paper
Generative-based Fusion Mechanism for Multi-Modal Tracking
2023cited by this paper
Resource-Efficient RGBD Aerial Tracking
2023influential reference
Learning Dual-Fused Modality-Aware Representations for RGBD Tracking
2022cited by this paper
SiamCDA: Complementarity- and Distractor-Aware RGB-T Tracking Based on Siamese Network
2022influential reference
Low-cost Multispectral Scene Analysis with Modality Distillation
2022influential reference
Robust Visual Tracking by Segmentation
2022cited by this paper
MixFormer: End-to-End Tracking with Iterative Mixed Attention
2022cited by this paper
Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework
2022influential reference
High-Performance Transformer Tracking
2022cited by this paper
RGB-T tracking by modality difference reduction and feature re-selection
2022cited by this paper
SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders
2022cited by this paper
Width & Depth Pruning for Vision Transformers
2022cited by this paper
LightViT: Towards Light-Weight Convolution-Free Vision Transformers
2022cited by this paper
Prompting for Multi-Modal Tracking
2022cited by this paper
Masked Autoencoders Enable Efficient Knowledge Distillers
2022cited by this paper
Robust Online Tracking With Meta-Updater
2022cited by this paper
Hierarchical Feature Embedding for Visual Tracking
2022cited by this paper
Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric
2022influential reference
Learning Target Candidate Association to Keep Track of What Not to Track
2021cited by this paper
Efficient Visual Tracking with Exemplar Transformers
2021cited by this paper
Object Tracking by Jointly Exploiting Frame and Event Domain
2021cited by this paper
Siamese infrared and visible light fusion network for RGB-T tracking
2021cited by this paper
DepthTrack: Unveiling the Power of RGBD Tracking
2021influential reference
VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows
2021influential reference
P2T: Pyramid Pooling Transformer for Scene Understanding
2021cited by this paper
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
2021cited by this paper
LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search
2021cited by this paper
LasHeR: A Large-Scale High-Diversity Benchmark for RGBT Tracking
2021influential reference
M5L: Multi-Modal Multi-Margin Metric Learning for RGBT Tracking
2021cited by this paper
Distilling Knowledge via Knowledge Review
2021cited by this paper
Learning Spatio-Temporal Transformer for Visual Tracking
2021cited by this paper
A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection
2020cited by this paper
The Eighth Visual Object Tracking VOT2020 Challenge Results
2020cited by this paper
Duality-Gated Mutual Condition Network for RGBT Tracking
2020cited by this paper
RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence Loss
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
DAL - A Deep Depth-aware Long-term Tracker
2019cited by this paper
SiamFT: An RGB-Infrared Fusion Tracking Method via Fully Convolutional Siamese Networks
2019cited by this paper
Learning Discriminative Model Prediction for Tracking
2019influential reference
Distilled Siamese Networks for Visual Tracking
2019cited by this paper
Structured Knowledge Distillation for Semantic Segmentation
2019cited by this paper
RGB-T Object Tracking: Benchmark and Baseline
2018cited by this paper
SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks
2018cited by this paper
Learning Efficient Object Detection Models with Knowledge Distillation
2017cited by this paper
Learning Multi-domain Convolutional Neural Networks for Visual Tracking
2015cited by this paper
Distilling the Knowledge in a Neural Network
2015cited by this paper
FitNets: Hints for Thin Deep Nets
2014cited by this paper

CITED BY

Dark miner: Towards combating residuals in concept erasure for text-to-image diffusion models
2026cites this paper
Adaptive gradient-oriented sampling and dynamic-blended shadow generation for realistic face relighting
2026cites this paper
FCNet: Extracting undistorted images for fine-grained image classification
2026cites this paper
Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking
2026cites this paper
DualFormer: A dual-branch transformer framework for frame-event tracking with complementary fusion attention and sparse spatial channel attention
2026cites this paper
Unsupervised pattern image retrieval via dual-encoder architecture with multi-head attention
2026cites this paper
Kronecker reparameterized large kernel for image compressed sensing
2026cites this paper
Advanced deep features fusion network for partial overlapping registration
2026cites this paper
STIFormer: RGB-T tracking via Spatial–Temporal Interaction Transformer
2026cites this paper
SPSRL: Open-vocabulary semantic segmentation with spatial prior and semantic relation learning
2026cites this paper
Multi-Event Representation and Multi-Level Fusion for Robust RGB-Event Object Tracking
2026cites this paper
MTTrack: A Joint Mamba-Transformer Framework with Memory Enhancement for Real-time Satellite Remote Sensing Video Object Tracking
2026cites this paper
Topology-aware contrastive learning for attributed graph clustering
2026cites this paper
Robust RGB-T Tracking via Multi-Feature Response Adaptive Fusion and Dynamic Selection Recovery
2026cites this paper
DCTNet: A dual-branch CNN-transformer network for SAR-optical image classification
2026cites this paper
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
2026cites this paper
UETrack: A Unified and Efficient Framework for Single Object Tracking
2026cites this paper
Expressive Keypoints for Skeleton-Based Action Recognition via Progressive Skeleton Evolution
2025cites this paper
fRAKI: k-space deep learning with offline data-universal and online scan-specific priors
2025cites this paper
Small object detection network based on progressive enhanced multi-level feature fusion
2025cites this paper
Component-Coordinated and Uncertainty-Enhanced LoRA for Few-shot Source-Free Domain Adaptive Object Detection
2025cites this paper
Learning Frequency and Memory-Aware Prompts for Multi-Modal Object Tracking
2025cites this paper
Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking
2025cites this paper
Trustworthy Visual-Textual Retrieval
2025cites this paper
Omni Survey for Multimodality Analysis in Visual Object Tracking
2025cites this paper
A generic class-agnostic object counting network with adaptive offset deformable convolution
2025cites this paper
STD-Explain: Generalizing explanations for spatio-temporal graph convolutional networks based on spatio-temporal decoupled perturbation
2025cites this paper
Fine-tuning feature interaction for unsupervised domain adaptive low-light object detection
2025cites this paper
Dual Uncertainty-Aware Correspondence Adapting and Retaining for Continual Composed Image Retrieval
2025cites this paper
Tracking and Segmenting Anything in Any Modality
2025cites this paper
Long-Term Visual Object Tracking with Event Cameras: An Associative Memory Augmented Tracker and A Benchmark Dataset
2024cites this paper
Relation-Aware Meta-Learning for Zero-Shot Sketch-Based Image Retrieval
2024cites this paper
Scene-adaptive semantic segmentation guided by multi-level boundary-semantic-reinforcement for unmanned surface vessels
year unknowncites this paper