3D HAMSTER: Bridging Robot VLA Planning and Control with 3D Trajectory Guidance

A new paper posted on arXiv introduces 3D HAMSTER, a framework designed to bridge the gap between high-level planning and low-level control in hierarchical Vision-Language-Action (VLA) models for robot manipulation.

Current state-of-the-art approaches in this paradigm use a Vision-Language Model (VLM) to predict 2D end-effector trajectories as explicit guidance for a downstream policy. However, low-level control policies typically operate in 3D metric space on point clouds, and feeding them 2D guidance that lacks depth creates a dimensionality mismatch that degrades performance.

3D HAMSTER：用 3D 轨迹引导弥合机器人 VLA 模型的规划与控制鸿沟 — Image source: robotsguide.com

3D HAMSTER's core innovation is elevating trajectory guidance from 2D to 3D space. By providing depth-aware trajectories, the semantic gap between high-level visual planning and low-level physical control is reduced, enabling stronger generalization to objects and environments not seen during training.

The research sits within the broader hierarchical VLA paradigm, which decouples high-level task planning from low-level motor control to improve robot manipulation generalization. 3D HAMSTER contributes a critical dimensional alignment improvement to this decoupled architecture.

This work has practical significance for real-world robot deployment, where robots must handle novel objects and scenarios never encountered during training. The 2D-to-3D elevation could reduce failures caused by missing depth information in the planning-to-control pipeline.

What to watch next: whether the framework scales to more complex multi-step manipulation tasks and whether it gets adopted into mainstream robot learning platforms or open-source frameworks.