VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models

Manav Kulshrestha,S. T. Bukhari,Damon Conover,Aniket Bera

Published 2025 in arXiv.org

ABSTRACT

Robotic grasping is a fundamental capability for autonomous manipulation; however, most existing methods rely on large-scale expert annotations and necessitate retraining to handle new objects. We present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting grasps. From a single RGB-D image, our method (1) prompts a large vision-language model to generate a goal image where a straight rod"impales"the object, representing an antipodal grasp, (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal component analysis and correspondence-free optimization to recover an executable grasp pose. Unlike prior work, our approach is training-free and does not rely on curated grasp datasets. Despite this, VLAD-Grasp achieves performance that is competitive with or superior to that of state-of-the-art supervised models on the Cornell and Jacquard datasets. We further demonstrate zero-shot generalization to novel real-world objects on a Franka Research 3 robot, highlighting vision-language foundation models as powerful priors for robotic manipulation.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-08
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2511.05791 arXiv 2511.05791
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges
2025cited by this paper
Variational Shape Inference for Grasp Diffusion on $\mathrm{SE(3)}$
2025cited by this paper
Multimodal Human-Intent Modeling for Contextual Robot-to-Human Handovers of Arbitrary Objects
2025cited by this paper
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
2025cited by this paper
Language-driven Grasp Detection
2024influential reference
Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts
2024cited by this paper
Chameleon: Mixed-Modal Early-Fusion Foundation Models
2024cited by this paper
A residual reinforcement learning method for robotic assembly using visual and force information
2024cited by this paper
Jacquard V2: Refining Datasets using the Human In the Loop Data Correction Method
2024cited by this paper
GPT-4o System Card
2024cited by this paper
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
2024cited by this paper
GraspSAM: When Segment Anything Model Meets Grasp Detection
2024cited by this paper
Structural Concept Learning via Graph Attention for Multi-Level Rearrangement Planning
2023cited by this paper
A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter
2023cited by this paper
Language-Driven Representation Learning for Robotics
2023cited by this paper
LLaMA: Open and Efficient Foundation Language Models
2023cited by this paper
GPT-4 Technical Report
2023cited by this paper
Segment Anything
2023cited by this paper
Grasp-Anything: Large-scale Grasp Dataset from Foundation Models
2023cited by this paper
SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs
2023cited by this paper
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
2023cited by this paper
Open X-Embodiment: Robotic Learning Datasets and RT-X Models : Open X-Embodiment Collaboration0
2023cited by this paper
Foundation models in robotics: Applications, challenges, and the future
2023cited by this paper
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
2022cited by this paper
CoGrasp: 6-DoF Grasp Generation for Human-Robot Collaboration
2022cited by this paper
DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics
2022cited by this paper
Deep Learning Approaches to Grasp Synthesis: A Review
2022cited by this paper
Flamingo: a Visual Language Model for Few-Shot Learning
2022cited by this paper
Robotic Waste Sorting Technology: Toward a Vision-Based Categorization System for the Industrial Robotic Separation of Recyclable Waste
2021cited by this paper
High-Resolution Image Synthesis with Latent Diffusion Models
2021cited by this paper
CLIPort: What and Where Pathways for Robotic Manipulation
2021cited by this paper
End-to-end Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
TEASER: Fast and Certifiable Point Cloud Registration
2020cited by this paper
ACRONYM: A Large-Scale Grasp Dataset Based on Simulation
2020cited by this paper
6-DOF GraspNet: Variational Grasp Generation for Object Manipulation
2019cited by this paper
Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network
2019cited by this paper
Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach
2018cited by this paper
Jacquard: A Large Scale Dataset for Robotic Grasp Detection
2018influential reference
Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics
2017cited by this paper
Efficient grasping from RGBD images: Learning using a new rectangle representation
2011influential reference
Efficient variants of the ICP algorithm
2001cited by this paper
Robotic grasping and contact: a review
2000cited by this paper
A Mathematical Introduction to Robotic Manipulation
1994cited by this paper
Parametric Correspondence and Chamfer Matching: Two New Techniques for Image Matching
1977cited by this paper

CITED BY

No citing papers are available for this paper.