Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models

Chubin Chen,Jiashu Zhu,Xiaokun Feng,Nisha Huang,Chen Zhu,Meiqi Wu,Fangyuan Mao,Jiahong Wu,Xiangxiang Chu,Xiu Li

Published 2025 in Unknown venue

ABSTRACT

Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S$^2$-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S$^2$-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.

PUBLICATION RECORD

Publication year
2025
Venue
Unknown venue
Publication date
2025-08-18
Fields of study
Computer Science
Identifiers
arXiv 2508.12880
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Artifact-Aware Evaluation for High-Quality Video Generation
2026cited by this paper
PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
2026cited by this paper
Exposing and Defending the Achilles'Heel of Video Mixture-of-Experts
2026cited by this paper
Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment
2026cited by this paper
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
2025cited by this paper
There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training
2025cited by this paper
PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System
2025cited by this paper
World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation
2025cited by this paper
Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
2025cited by this paper
NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models
2025cited by this paper
Diffusion Dataset Condensation: Training Your Diffusion Model Faster with Less Data
2025cited by this paper
Angle Domain Guidance: Latent Diffusion Requires Rotation Rather Than Extrapolation
2025cited by this paper
Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning
2025cited by this paper
Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking
2025cited by this paper
Fast Adversarial Training With Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on Videos
2025cited by this paper
Understanding the Repeat Curse in Large Language Models from a Feature Perspective
2025cited by this paper
FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos
2025cited by this paper
Follow-Your-Click: Open-domain Regional Image Animation via Motion Prompts
2025cited by this paper
CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models
2025influential reference
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
2025cited by this paper
Wan: Open and Advanced Large-Scale Video Generative Models
2025cited by this paper
VMBench: A Benchmark for Perception-Aligned Video Motion Generation
2025cited by this paper
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
2025cited by this paper
Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well
2025cited by this paper
SPG: Improving Motion Diffusion by Smooth Perturbation Guidance
2025cited by this paper
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
2025cited by this paper
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
2025cited by this paper
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
2025cited by this paper
InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences
2024cited by this paper
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
2024cited by this paper
Convergence Analysis for General Probability Flow ODEs of Diffusion Models in Wasserstein Distances
2024cited by this paper
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
2024influential reference
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
2024cited by this paper
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
2024cited by this paper
Guiding a Diffusion Model with a Bad Version of Itself
2024influential reference
Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation
2024cited by this paper
CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models
2024influential reference
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
2024cited by this paper
DiTFastAttn: Attention Compression for Diffusion Transformer Models
2024cited by this paper
Diffusion Models in Low-Level Vision: A Survey
2024cited by this paper
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
2024cited by this paper
Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness
2024cited by this paper
Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention
2024influential reference
Classifier-Free Guidance is a Predictor-Corrector
2024cited by this paper
Token Caching for Diffusion Transformer Acceleration
2024cited by this paper
Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models
2024influential reference
Uncertainty-Aware Pseudo-Labeling and Dual Graph Driven Network for Incomplete Multi-View Multi-Label Classification
2024cited by this paper
A Survey of AI-Generated Video Evaluation
2024cited by this paper
Taming Rectified Flow for Inversion and Editing
2024cited by this paper
Stable Flow: Vital Layers for Training-Free Image Editing
2024cited by this paper
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
2024cited by this paper
HunyuanVideo: A Systematic Framework For Large Video Generative Models
2024cited by this paper
Real-world Image Dehazing with Coherence-based Pseudo Labeling and Cooperative Unfolding Network
2024cited by this paper
Region-Aware Diffusion for Zero-shot Text-driven Image Editing
2023cited by this paper
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
2023cited by this paper
Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus on Videos
2023cited by this paper
VBench: Comprehensive Benchmark Suite for Video Generative Models
2023cited by this paper
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
2023cited by this paper
Masked Diffusion Models Are Fast Distribution Learners
2023cited by this paper
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
2023cited by this paper
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
2023cited by this paper
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
2023cited by this paper
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
2023cited by this paper
LIME: Localized Image Editing via Attention Regularization in Diffusion Models
2023cited by this paper
Diffusion Models in Vision: A Survey
2022cited by this paper
Classifier-Free Diffusion Guidance
2022cited by this paper
Verifying the Union of Manifolds Hypothesis for Image Data
2022cited by this paper
Scalable Diffusion Models with Transformers
2022influential reference
DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization
2022cited by this paper
Improving Sample Quality of Diffusion Models Using Self-Attention Guidance
2022influential reference
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
2022cited by this paper
Diffusion Models Beat GANs on Image Synthesis
2021cited by this paper
Conditional Positional Encodings for Vision Transformers
2021cited by this paper
High-Resolution Image Synthesis with Latent Diffusion Models
2021cited by this paper
Twins: Revisiting the Design of Spatial Attention in Vision Transformers
2021cited by this paper
The Intrinsic Dimension of Images and Its Impact on Learning
2021cited by this paper
Denoising Diffusion Probabilistic Models
2020cited by this paper
Score-Based Generative Modeling through Stochastic Differential Equations
2020cited by this paper
Denoising Diffusion Implicit Models
2020cited by this paper
BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning
2020cited by this paper
Snapshot Ensembles: Train 1, get M for free
2017cited by this paper
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
2016cited by this paper
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
2015cited by this paper
Paper
1977cited by this paper

CITED BY

Artifact-Aware Evaluation for High-Quality Video Generation
2026cites this paper
Hearing is Believing? Evaluating and Analyzing Audio Language Model Sycophancy with SYAUDIO
2026cites this paper
Exposing and Defending the Achilles'Heel of Video Mixture-of-Experts
2026cites this paper
BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models
2026cites this paper
Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation
2026cites this paper
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
2026influential citation
PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
2026cites this paper
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
2025cites this paper
EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance
2025influential citation
MAMBO-G: Magnitude-Aware Mitigation for Boosted Guidance
2025cites this paper
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
2025cites this paper
Guiding a Diffusion Transformer with the Internal Dynamics of Itself
2025cites this paper
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
2025cites this paper
Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
2025cites this paper
Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control
2025cites this paper
Home-made Diffusion Model from Scratch to Hatch
2025cites this paper
World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation
2025cites this paper
There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training
2025cites this paper
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
2025cites this paper