AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

Litian Gong,Fatemeh Bahrani,Yutai Zhou,Amin Banayeeanzade,Jiachen Li,Erdem Biyik

Published 2025 in arXiv.org

ABSTRACT

AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-23
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2511.18617 arXiv 2511.18617
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Good old-fashioned engineering can close the 100,000-year "data gap" in robotics
2025cited by this paper
Crisp Attention: Regularizing Transformers via Structured Sparsity
2025cited by this paper
GABRIL: Gaze-Based Regularization for Mitigating Causal Confusion in Imitation Learning
2025influential reference
A Survey of Context Engineering for Large Language Models
2025cited by this paper
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
2025cited by this paper
ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations
2025cited by this paper
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
2025cited by this paper
IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
2025cited by this paper
A Taxonomy for Evaluating Generalist Robot Policies
2025cited by this paper
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
2025cited by this paper
Qwen2.5-VL Technical Report
2025cited by this paper
What Matters in Learning from Large-Scale Datasets for Robot Manipulation
2025cited by this paper
MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation
2024cited by this paper
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
2024cited by this paper
ViSaRL: Visual Reinforcement Learning Guided by Human Saliency
2024cited by this paper
π0: A Vision-Language-Action Flow Model for General Robot Control
2024cited by this paper
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
2024cited by this paper
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving
2024cited by this paper
Gaze Supervision for Mitigating Causal Confusion in Driving Agents
2024cited by this paper
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback
2024cited by this paper
PaLM-E: An Embodied Multimodal Language Model
2023cited by this paper
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
2023cited by this paper
Eureka: Human-Level Reward Design via Coding Large Language Models
2023cited by this paper
RoboCLIP: One Demonstration is Enough to Learn Robot Policies
2023cited by this paper
Tracking Anything with Decoupled Video Segmentation
2023cited by this paper
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
2023cited by this paper
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
2023cited by this paper
Language to Rewards for Robotic Skill Synthesis
2023cited by this paper
Vision-Language Models as Success Detectors
2023cited by this paper
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
2023cited by this paper
GenAug: Retargeting behaviors to unseen situations via Generative Augmentation
2023cited by this paper
In the eye of the beholder: A survey of gaze tracking techniques
2022cited by this paper
Memory-based gaze prediction in deep imitation learning for robot manipulation
2022cited by this paper
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
2022cited by this paper
Peripheral vision in real-world tasks: A systematic review
2022cited by this paper
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
2022cited by this paper
Interactive Imitation Learning in Robotics: A Survey
2022cited by this paper
Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning
2021cited by this paper
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
2021cited by this paper
Keyframe-Focused Visual Imitation Learning
2021cited by this paper
Imitation Learning with Human Eye Gaze via Multi-Objective Prediction
2021cited by this paper
Improving Deep Learning Interpretability by Saliency Guided Training
2021cited by this paper
CURL: Contrastive Unsupervised Representations for Reinforcement Learning
2020cited by this paper
Reinforcement Learning with Augmented Data
2020cited by this paper
Efficiently Guiding Imitation Learning Agents with Human Gaze
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Human Gaze Assisted Artificial Intelligence: A Review
2020cited by this paper
Using Human Gaze to Improve Robustness Against Irrelevant Objects in Robot Manipulation Tasks
2020cited by this paper
Causal Confusion in Imitation Learning
2019cited by this paper
Gaze Training by Modulated Dropout Improves Imitation Learning
2019cited by this paper
A top-down saliency model with goal relevance.
2019cited by this paper
Reinforcement learning
2019cited by this paper
HG-DAgger: Interactive Imitation Learning with Human Experts
2018cited by this paper
AGIL: Learning Attention from Human for Visuomotor Tasks
2018cited by this paper
An Algorithmic Perspective on Imitation Learning
2018cited by this paper
Domain randomization for transferring deep neural networks from simulation to the real world
2017cited by this paper
Moving object detection based on frame difference and W4
2017cited by this paper
Real-Time Detection, Tracking and Classification of Multiple Moving Objects in UAV Videos
2017cited by this paper
Social eye gaze in human-robot interaction
2017cited by this paper
CARLA: An Open Urban Driving Simulator
2017cited by this paper
Distilling the Knowledge in a Neural Network
2015cited by this paper
ALVINN, an autonomous land vehicle in a neural network
2015cited by this paper
You Only Look Once: Unified, Real-Time Object Detection
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Visualizing and Understanding Convolutional Networks
2013cited by this paper
A Survey on Eye-Gaze Tracking Techniques
2013cited by this paper
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
2010cited by this paper
The contributions of central versus peripheral vision to scene gist recognition.
2009cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
A survey of robot learning from demonstration
2009cited by this paper
Active Learning Literature Survey
2009cited by this paper
The Hungarian method for the assignment problem
1955cited by this paper

CITED BY

No citing papers are available for this paper.