Laminating Representation Autoencoders for Efficient Diffusion

Ram'on Calvo-Gonz'alez,Franccois Fleuret

Published 2026 in Unknown venue

ABSTRACT

Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-04
Fields of study
Computer Science
Identifiers
arXiv 2602.04873
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
2025cited by this paper
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
2025influential reference
PixelDiT: Pixel Diffusion Transformers for Image Generation
2025cited by this paper
Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think
2025cited by this paper
What matters for Representation Alignment: Global Information or Spatial Structure?
2025influential reference
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
2025cited by this paper
PixelFlow: Pixel-Space Generative Models with Flow
2025cited by this paper
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
2025cited by this paper
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
2025cited by this paper
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
2024cited by this paper
A Review on Discriminative Self-supervised Learning Methods in Computer Vision
2024cited by this paper
Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion
2024cited by this paper
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
2024influential reference
Autoregressive Image Generation without Vector Quantization
2024cited by this paper
An Image is Worth 32 Tokens for Reconstruction and Generation
2024cited by this paper
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
2024cited by this paper
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
2024cited by this paper
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
2024cited by this paper
DINOv2: Learning Robust Visual Features without Supervision
2023cited by this paper
Vision Transformers Need Registers
2023influential reference
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
2023cited by this paper
Fast Training of Diffusion Models with Masked Transformers
2023cited by this paper
Diffusion policy: Visuomotor policy learning via action diffusion
2023cited by this paper
Splicing ViT Features for Semantic Appearance Transfer
2022cited by this paper
Scalable Diffusion Models with Transformers
2022cited by this paper
High-Resolution Image Synthesis with Latent Diffusion Models
2021influential reference
Diffusion Models Beat GANs on Image Synthesis
2021cited by this paper
Emerging Properties in Self-Supervised Vision Transformers
2021cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Denoising Diffusion Implicit Models
2020cited by this paper
AUTO-ENCODING VARIATIONAL BAYES
2020cited by this paper
GLU Variants Improve Transformer
2020influential reference
A Simple Framework for Contrastive Learning of Visual Representations
2020cited by this paper
Taming Transformers for High-Resolution Image Synthesis
2020cited by this paper
Score-Based Generative Modeling through Stochastic Differential Equations
2020cited by this paper
Generative Modeling by Estimating Gradients of the Data Distribution
2019cited by this paper
Momentum Contrast for Unsupervised Visual Representation Learning
2019cited by this paper
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
2018cited by this paper
Neural Discrete Representation Learning
2017cited by this paper
Deep Image Prior
2017cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
2016cited by this paper
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
2015cited by this paper
Food-101 - Mining Discriminative Components with Random Forests
2014cited by this paper
Describing Textures in the Wild
2013cited by this paper
Learning Multiple Layers of Features from Tiny Images
2009cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Automated Flower Classification over a Large Number of Classes
2008cited by this paper
Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories
2004cited by this paper
cats and dogs
2003cited by this paper

CITED BY

No citing papers are available for this paper.