MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation

Haoming Wang,Qiyao Xue,Weichen Liu,Wei Gao

Published 2026 in Unknown venue

ABSTRACT

When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-06
Fields of study
Computer Science, Engineering
Identifiers
arXiv 2602.07082
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A Survey of Object Goal Navigation
2025cited by this paper
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
2025cited by this paper
Cambrian-S: Towards Spatial Supersensing in Video
2025cited by this paper
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
2025cited by this paper
Continuous Space-Time Video Resampling with Invertible Motion Steganography
2025cited by this paper
Video Language Model Pretraining with Spatio-temporal Masking
2025cited by this paper
Spatial Mental Modeling from Limited Views
2025cited by this paper
MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
2025cited by this paper
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
2025cited by this paper
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
2025cited by this paper
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
2025cited by this paper
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
2025cited by this paper
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
2025cited by this paper
VistaDepth: Frequency Modulation With Bias Reweighting For Enhanced Long-Range Depth Estimation
2025cited by this paper
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science
2025cited by this paper
Re-thinking Temporal Search for Long-Form Video Understanding
2025cited by this paper
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
2025cited by this paper
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
2025cited by this paper
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
2025cited by this paper
Challenges and Trends in Egocentric Vision: A Survey
2025cited by this paper
Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models
2025cited by this paper
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
2025cited by this paper
Qwen2.5-VL Technical Report
2025cited by this paper
CoS: Chain-of-Shot Prompting for Long Video Understanding
2025cited by this paper
Understanding Long Videos via LLM-Powered Entity Relation Graphs
2025cited by this paper
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training
2025cited by this paper
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
2025cited by this paper
MAPWise: Evaluating Vision-Language Models for Advanced Map Queries
2025cited by this paper
GPT-4o System Card
2024cited by this paper
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
2024cited by this paper
Energy Demand in AR Applications—A Reverse Ablation Study of the HoloLens 2 Device
2024cited by this paper
YOLO-World: Real-Time Open-Vocabulary Object Detection
2024cited by this paper
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
2024cited by this paper
Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models
2024cited by this paper
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
2024cited by this paper
BLINK: Multimodal Large Language Models Can See but Not Perceive
2024cited by this paper
A Survey on Vision-Language-Action Models for Embodied AI
2024cited by this paper
Depth Anything V2
2024cited by this paper
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
2024cited by this paper
Language-driven Grasp Detection
2024cited by this paper
Vision language models are blind
2024cited by this paper
A Survey of Embodied Learning for Object-centric Robotic Manipulation
2024cited by this paper
Multi-modal Situated Reasoning in 3D Scenes
2024cited by this paper
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
2024cited by this paper
Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation
2024cited by this paper
Cross-Embodiment Dexterous Grasping with Reinforcement Learning
2024cited by this paper
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
2024cited by this paper
Multiview Scene Graph
2024cited by this paper
EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning
2024cited by this paper
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
2024influential reference
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
2024cited by this paper
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
2024cited by this paper
RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World
2024cited by this paper
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
2024influential reference
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
2024cited by this paper
Temporally Consistent Object-Centric Learning by Contrasting Slots
2024cited by this paper
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
2023cited by this paper
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent
2023cited by this paper
3D-LLM: Injecting the 3D World into Large Language Models
2023cited by this paper
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
2023cited by this paper
Panoptic Video Scene Graph Generation
2023cited by this paper
PerceptionGPT: Effectively Fusing Visual Perception Into LLM
2023cited by this paper
Segment Anything
2023cited by this paper
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
2023cited by this paper
Advances in Embodied Navigation Using Large Language Models: A Survey
2023cited by this paper
MonoNav: MAV Navigation via Monocular Depth Estimation and Reconstruction
2023cited by this paper
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
2023influential reference
Exploring and Improving the Spatial Reasoning Abilities of Large Language Models
2023cited by this paper
MobileSAMv2: Faster Segment Anything to Everything
2023cited by this paper
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
2023cited by this paper
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
2023cited by this paper
Evaluating Spatial Understanding of Large Language Models
2023cited by this paper
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
2021cited by this paper
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs
2021cited by this paper
A Survey of Embodied AI: From Simulators to Research Tasks
2021cited by this paper
Edge Assisted Real-time Object Detection for Mobile Augmented Reality
2019cited by this paper
Key-Frame Extraction Based on HSV Histogram and Adaptive Clustering
2019cited by this paper
Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss
2018cited by this paper
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
2018cited by this paper
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
2017cited by this paper
Embodied Question Answering
2017cited by this paper
The Emergence of Edge Computing
2017cited by this paper
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
2017cited by this paper
Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection
2016cited by this paper
Video Synopsis by Heterogeneous Multi-source Correlation
2013cited by this paper
Evaluation of Histogram-Based Similarity Functions for Different Color Spaces
2011cited by this paper
ORB: An efficient alternative to SIFT or SURF
2011cited by this paper
Image quality assessment: from error visibility to structural similarity
2004influential reference
Color indexing
1991cited by this paper
Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness
1985cited by this paper
Mental Models in Cognitive Science
1980cited by this paper

CITED BY

No citing papers are available for this paper.