RoDiF: Robust Direct Fine-Tuning of Diffusion Policies with Corrupted Human Feedback

Published 2026 in Unknown venue

ABSTRACT

Diffusion policies are a powerful paradigm for robotic control, but fine-tuning them with human preferences is fundamentally challenged by the multi-step structure of the denoising process. To overcome this, we introduce a Unified Markov Decision Process (MDP) formulation that coherently integrates the diffusion denoising chain with environmental dynamics, enabling reward-free Direct Preference Optimization (DPO) for diffusion policies. Building on this formulation, we propose RoDiF (Robust Direct Fine-Tuning), a method that explicitly addresses corrupted human preferences. RoDiF reinterprets the DPO objective through a geometric hypothesis-cutting perspective and employs a conservative cutting strategy to achieve robustness without assuming any specific noise distribution. Extensive experiments on long-horizon manipulation tasks show that RoDiF consistently outperforms state-of-the-art baselines, effectively steering pretrained diffusion policies of diverse architectures to human-preferred modes, while maintaining strong performance even under 30% corrupted preference labels.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-01-31
Fields of study
Computer Science, Engineering
Identifiers
arXiv 2602.00886
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Diffusion models for robotic manipulation: a survey
2025cited by this paper
Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation
2025cited by this paper
Efficient Online Reinforcement Learning for Diffusion Policy
2025cited by this paper
FDPP: Fine-Tune Diffusion Policy with Human Preference
2025cited by this paper
Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation
2025cited by this paper
Lightweight Robust Direct Preference Optimization
2025cited by this paper
HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation
2025cited by this paper
Policy Learning from Large Vision-Language Model Feedback Without Reward Modeling
2025cited by this paper
A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots
2025cited by this paper
Human-assisted Robotic Policy Refinement via Action Preference Optimization
2025cited by this paper
TREND: Tri-Teaching for Robust Preference-based Reinforcement Learning with Demonstrations
2025cited by this paper
Robust Reward Alignment via Hypothesis Space Batch Cutting
2025cited by this paper
CANDERE-COACH: Reinforcement Learning from Noisy Feedback
2024cited by this paper
Mixing corrupted preferences for robust and feedback-efficient preference-based reinforcement learning
2024cited by this paper
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
2024cited by this paper
Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations
2024cited by this paper
RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences
2024cited by this paper
SimPO: Simple Preference Optimization with a Reference-Free Reward
2024cited by this paper
Group Robust Preference Optimization in Reward-free RLHF
2024cited by this paper
Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning
2024cited by this paper
Forward KL Regularized Preference Optimization for Aligning Diffusion Policies
2024cited by this paper
Contrastive Preference Learning: Learning from Human Feedback without RL
2023influential reference
Reinforcement Learning from Diverse Human Preferences
2023cited by this paper
Direct Preference-based Policy Optimization without Reward Modeling
2023cited by this paper
Diffusion policy: Visuomotor policy learning via action diffusion
2023influential reference
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
2023cited by this paper
Training Diffusion Models with Reinforcement Learning
2023cited by this paper
Inverse Preference Learning: Preference-based RL without a Reward Function
2023cited by this paper
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
2023cited by this paper
Learning Optimal Advantage from Preferences and Mistaking it for Reward
2023cited by this paper
A General Theoretical Paradigm to Understand Learning from Human Preferences
2023cited by this paper
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
2023cited by this paper
Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
2022cited by this paper
Advances in Preference-based Reinforcement Learning: A Review
2022cited by this paper
B-Pref: Benchmarking Preference-Based Reinforcement Learning
2021cited by this paper
PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training
2021cited by this paper
Denoising Diffusion Probabilistic Models
2020cited by this paper
Learning From Human Directional Corrections
2020cited by this paper
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
2019cited by this paper
SIMPO
2018cited by this paper
Deep Reinforcement Learning from Human Preferences
2017cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Active learning
2013cited by this paper
Reinforcement learning by reward-weighted regression for operational space control
2007cited by this paper
The Analysis of Permutations
1975cited by this paper
RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS
1952influential reference

CITED BY

No citing papers are available for this paper.