DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation

Bowen Yin,Jiao-Long Cao,Ming-Ming Cheng,Qibin Hou

Published 2025 in Computer Vision and Pattern Recognition

ABSTRACT

Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that DFormerv2 exhibits exceptional performance in various RGBD semantic segmentation benchmarks. Code is available at: https://github.com/VCIP-RGBD/DFormer.

PUBLICATION RECORD

Publication year
2025
Venue
Computer Vision and Pattern Recognition
Publication date
2025-04-07
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR52734.2025.01802 arXiv 2504.04701
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer
2025cited by this paper
SPT: Sequence Prompt Transformer for Interactive Image Segmentation
2025cited by this paper
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
2024influential reference
WFSS: weighted fusion of spectral transformer and spatial self-attention for robust hyperspectral image classification against adversarial attacks
2024cited by this paper
Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer
2024cited by this paper
PROGRAM: PROtotype GRAph Model based Pseudo-Label Learning for Test-Time Adaptation
2024cited by this paper
PrimKD: Primary Modality Guided Multimodal Fusion for RGB-D Semantic Segmentation
2024cited by this paper
PrimitiveNet: decomposing the global constraints for referring segmentation
2024cited by this paper
Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM
2023cited by this paper
Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications
2023cited by this paper
BiFormer: Vision Transformer with Bi-Level Routing Attention
2023cited by this paper
Traffic Scene Parsing Through the TSP6K Dataset
2023cited by this paper
PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing
2023cited by this paper
Delivering Arbitrary-Modal Semantic Segmentation
2023influential reference
Agent Attention: On the Integration of Softmax and Linear Attention
2023cited by this paper
AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation
2023cited by this paper
RMT: Retentive Networks Meet Vision Transformers
2023cited by this paper
DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation
2023influential reference
Sequential interactive image segmentation
2023cited by this paper
Retentive Network: A Successor to Transformer for Large Language Models
2023cited by this paper
CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation
2023cited by this paper
Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
2022cited by this paper
Vision Transformer with Deformable Attention
2022cited by this paper
A ConvNet for the 2020s
2022cited by this paper
Omnivore: A Single Model for Many Visual Modalities
2022cited by this paper
Multi-modal Sensor Fusion for Auto Driving Perception: A Survey
2022cited by this paper
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers
2022influential reference
MultiMAE: Multi-modal Multi-task Masked Autoencoders
2022cited by this paper
Neighborhood Attention Transformer
2022cited by this paper
Multimodal Token Fusion for Vision Transformers
2022influential reference
Cross-Domain Correlation Distillation for Unsupervised Domain Adaptation in Nighttime Semantic Segmentation
2022cited by this paper
FRNet: Feature Reconstruction Network for RGB-D Indoor Scene Parsing
2022cited by this paper
HRFuser: A Multi-Resolution Sensor Fusion Architecture for 2D Object Detection
2022cited by this paper
Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments
2022cited by this paper
SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation
2022cited by this paper
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition
2022cited by this paper
CamoFormer: Masked Separable Attention for Camouflaged Object Detection
2022cited by this paper
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
2021influential reference
Masked-attention Mask Transformer for Universal Image Segmentation
2021cited by this paper
CANet: Co-attention network for RGB-D semantic segmentation
2021cited by this paper
CDAda: A Curriculum Domain Adaptation for Nighttime Semantic Segmentation
2021cited by this paper
Segmenter: Transformer for Semantic Segmentation
2021cited by this paper
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
2021cited by this paper
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
2021cited by this paper
ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation
2021cited by this paper
Specificity-preserving RGB-D saliency detection
2021cited by this paper
Per-Pixel Classification is Not All You Need for Semantic Segmentation
2021cited by this paper
CMT: Convolutional Neural Networks Meet Vision Transformers
2021cited by this paper
Focal Self-attention for Local-Global Interactions in Vision Transformers
2021cited by this paper
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
2021cited by this paper
PVT v2: Improved baselines with Pyramid Vision Transformer
2021cited by this paper
Is Attention Better Than Matrix Decomposition?
2021cited by this paper
Large-Scale Unsupervised Semantic Segmentation
2021cited by this paper
Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation
2020cited by this paper
Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis
2020cited by this paper
Deep Multimodal Fusion by Channel Exchanging
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation
2020cited by this paper
A Simple Pooling-Based Design for Real-Time Salient Object Detection
2019cited by this paper
ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation
2019cited by this paper
Pattern-Affinitive Propagation Across Depth, Surface Normal and Semantic Segmentation
2019cited by this paper
Momentum Contrast for Unsupervised Visual Representation Learning
2019cited by this paper
Learning Densities in Feature Space for Reliable Segmentation of Indoor Scenes
2019cited by this paper
Self-Attention with Relative Position Representations
2018cited by this paper
Scene Parsing through ADE20K Dataset
2017cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
SUN RGB-D: A RGB-D scene understanding benchmark suite
2015influential reference
How to Evaluate Foreground Maps
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Salient Object Detection: A Discriminative Regional Feature Integration Approach
2013cited by this paper
Indoor Segmentation and Support Inference from RGBD Images
2012cited by this paper
Saliency filters: Contrast based filtering for salient region detection
2012cited by this paper

CITED BY

Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks
2026cites this paper
Hierarchical knowledge transfer-based model for missing structural response reconstruction of offshore jacket platform
2026cites this paper
Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors
2026cites this paper
DGA-Net: Enhancing SAM with Depth Prompting and Graph-Anchor Guidance for Camouflaged Object Detection
2026cites this paper
Relational Structure-Aware Mamba Network for Semantic Segmentation of Remote Sensing Images
2026cites this paper
KeyGeoFusion: A multi-modal keypoint and geometry-aware framework for small and distant 3D object detection in sparse point clouds
2026cites this paper
Contrast-Driven Multi-Modal Fusion for Autonomous Lunar Rover Perception: Efficient Obstacle Segmentation
2026cites this paper
A Multi-Modal Image Fusion Network with Learnable Color Restoration and Semantic Guidance: Towards Real-Time Robot Perception and Scene Parsing
2026cites this paper
GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning
2026cites this paper
Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data
2026cites this paper
VSFusion: Video-Sensor Multimodal Fusion with Gated Geometry Network for Freezing of Gait Detection in Parkinson's Disease
2025cites this paper
Multimodal learning on RGB-D image for precise litchi phenotyping and weight estimation
2025cites this paper
Depth Anything at Any Condition
2025cites this paper
Geo-RepNet: Geometry-Aware Representation Learning for Surgical Phase Recognition in Endoscopic Submucosal Dissection
2025cites this paper
Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment
2025cites this paper
Unleashing Semantic and Geometric Priors for 3D Scene Completion
2025cites this paper
TUNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion
2025cites this paper
Adaptive sparse contrastive learning for unsupervised object re-identification
2025cites this paper
AdaptRGB-t: Adaptive RGB-t semantic segmentation via efficient parameter-tuning with textual guidance
2025cites this paper
DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation
2025influential citation
GeoUDA: Geometry-Aware Unsupervised Domain Adaptation for Urban-Building Segmentation Under Adverse Nighttime Conditions
2025cites this paper
VoxDepth: Rectification of Depth Images on Edge Devices
2024cites this paper
IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks
2024cites this paper