Data Augmentation for Instruction Following Policies via Trajectory Segmentation

Published 2025 in AAAI Conference on Artificial Intelligence

ABSTRACT

The scalability of instructable agents in robotics or gaming is often hindered by limited data that pairs instructions with agent trajectories. However, large datasets of unannotated trajectories containing sequences of various agent behaviour (play trajectories) are often available. In a semi-supervised setup, we explore methods to extract labelled segments from play trajectories. The goal is to augment a small annotated dataset of instruction-trajectory pairs to improve the performance of an instruction-following policy trained downstream via imitation learning. Assuming little variation in segment length, recent video segmentation methods can effectively extract labelled segments. To address the constraint of segment length, we propose Play Segmentation (PS), a probabilistic model that finds maximum likely segmentations of extended subsegments, while only being trained on individual instruction segments. Our results in a game environment and a simulated robotic gripper setting underscore the importance of segmentation; randomly sampled segments diminish performance, while incorporating labelled segments from PS improves policy performance to the level of a policy trained on twice the amount of labelled data.

PUBLICATION RECORD

Publication year
2025
Venue
AAAI Conference on Artificial Intelligence
Publication date
2025-02-25
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2503.01871 arXiv 2503.01871
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Language-guided Skill Learning with Temporal Variational Inference
2024cited by this paper
Temporal Action Localization in the Deep Learning Era: A Survey
2023cited by this paper
MimicPlay: Long-Horizon Imitation Learning by Watching Human Play
2023cited by this paper
UnLoc: A Unified Framework for Video Localization Tasks
2023influential reference
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
2023cited by this paper
Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks
2022cited by this paper
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
2022cited by this paper
Leveraging Action Affinity and Continuity for Semi-supervised Temporal Action Segmentation
2022cited by this paper
What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data
2022cited by this paper
Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities
2022cited by this paper
BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning
2022cited by this paper
Policy Adaptation from Foundation Model Feedback
2022cited by this paper
Temporal Action Segmentation: An Analysis of Modern Techniques
2022cited by this paper
Interactive Language: Talking to Robots in Real Time
2022cited by this paper
Skill Induction and Planning with Latent Language
2021influential reference
Unsupervised Action Segmentation by Joint Representation Learning and Online Clustering
2021cited by this paper
ASFormer: Transformer for Action Segmentation
2021cited by this paper
Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning
2021cited by this paper
CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
2021influential reference
Learning Transferable Motor Skills with Hierarchical Latent Mixture Policies
2021cited by this paper
BabyAI 1.1
2020cited by this paper
Language Conditioned Imitation Learning Over Unstructured Data
2020cited by this paper
COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis
2019cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
2019cited by this paper
Learning Latent Plans from Play
2019cited by this paper
CompILE: Compositional Imitation Learning and Execution
2018cited by this paper
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
2017cited by this paper
Attention is All you Need
2017cited by this paper
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
2017cited by this paper
Multi-Level Discovery of Deep Options
2017cited by this paper
Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation
2016cited by this paper
On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
2014cited by this paper
The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities
2014cited by this paper
Learning and generalization of complex tasks from unstructured demonstrations
2012cited by this paper

CITED BY

No citing papers are available for this paper.