Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Bac Nguyen,Yuhta Takida,N. Murata,Chieh-Hsin Lai,Toshimitsu Uesaka,Stefano Ermon,Yuki Mitsufuji

Published 2026 in arXiv.org

ABSTRACT

Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes. Code and pretrained models are available at https://github.com/sony/coda.

PUBLICATION RECORD

Publication year
2026
Venue
arXiv.org
Publication date
2026-01-03
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2601.01224 arXiv 2601.01224
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

On the Transfer of Object-Centric Representation Learning
2025cited by this paper
Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation
2025influential reference
Guiding a Diffusion Model with a Bad Version of Itself
2024cited by this paper
Learning to Compose: Improving Object Centric Learning by Injecting Compositionality
2024cited by this paper
Entity-Centric Reinforcement Learning for Object Manipulation from Pixels
2024cited by this paper
Temporally Consistent Object-Centric Learning by Contrasting Slots
2024cited by this paper
GLASS: Guided Latent Slot Diffusion for Object-Centric Learning
2024influential reference
When Attention Sink Emerges in Language Models: An Empirical View
2024cited by this paper
Adaptive Slot Attention: Object Discovery with Dynamic Slot Number
2024cited by this paper
Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning
2023cited by this paper
Object Discovery from Motion-Guided Tokens
2023cited by this paper
Object-Centric Slot Diffusion
2023influential reference
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
2023cited by this paper
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
2023cited by this paper
Information-Theoretic Diffusion
2023cited by this paper
SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers
2023cited by this paper
Think before you speak: Training Language Models With Pause Tokens
2023cited by this paper
Efficient Streaming Language Models with Attention Sinks
2023cited by this paper
Vision Transformers Need Registers
2023cited by this paper
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
2023cited by this paper
Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities
2023cited by this paper
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
2023influential reference
DINOv2: Learning Robust Visual Features without Supervision
2023cited by this paper
Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos
2022cited by this paper
GroupViT: Semantic Segmentation Emerges from Text Supervision
2022cited by this paper
Kubric: A scalable dataset generator
2022cited by this paper
Discovering Objects that Can Move
2022cited by this paper
Classifier-Free Diffusion Guidance
2022influential reference
Self-Supervised Visual Representation Learning with Semantic Grouping
2022cited by this paper
SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
2022cited by this paper
Bridging the Gap to Real-World Object-Centric Learning
2022cited by this paper
Multi-Concept Customization of Text-to-Image Diffusion
2022cited by this paper
Scalable Diffusion Models with Transformers
2022cited by this paper
High-Resolution Image Synthesis with Latent Diffusion Models
2021influential reference
Toward Causal Representation Learning
2021cited by this paper
Conditional Object-Centric Learning from Video
2021cited by this paper
Illiterate DALL-E Learns to Compose
2021cited by this paper
Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning
2021cited by this paper
Generalization and Robustness Implications in Object-Centric Learning
2021cited by this paper
How Modular Should Neural Module Networks Be for Systematic Generalization?
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Score-Based Generative Modeling through Stochastic Differential Equations
2020cited by this paper
Taming Transformers for High-Resolution Image Synthesis
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Contrastive Learning with Hard Negative Samples
2020cited by this paper
Object-Centric Learning with Slot Attention
2020influential reference
Denoising Diffusion Probabilistic Models
2020cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs
2019cited by this paper
Demystifying MMD GANs
2018influential reference
Representation Learning with Contrastive Predictive Coding
2018cited by this paper
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
2017influential reference
Attention is All you Need
2017cited by this paper
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
2015cited by this paper
On comparing partitions
2015cited by this paper
On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
Semantic contours from inverse detectors
2011cited by this paper
Paper
1977cited by this paper

CITED BY

No citing papers are available for this paper.