Project Website
Large-scale real-world robot data collection is a prerequisite for bringing robots into everyday deployment. However, existing pipelines often rely on specialized handheld devices, which increase operator burden, limit scalability, and fail to capture the naturally coordinated perception-manipulation behavior of daily human interaction.
ActiveGlasses introduces a head-mounted system for learning robot manipulation from ego-centric human demonstrations with active vision. A stereo camera mounted on smart glasses is used as the only perception device during both collection and inference: the human wears it during bare-hand demonstrations, and the same setup is mounted on a 6-DoF perception arm at deployment time.
To bridge the embodiment gap, the method extracts object trajectories and trains an object-centric point-cloud policy that jointly predicts manipulation and viewpoint motion. Across tasks involving occlusion and high precision interaction, the system outperforms strong baselines under the same hardware setup and transfers across two robot platforms.
To truly capture human intelligence, the robot data collection process must inherently align with human nature across two fundamental dimensions: manipulation and perception. In ActiveGlasses, we propose:
Existing systems often require teleoperation or handheld devices. ActiveGlasses instead keeps data collection close to natural human behavior.
Human head motion is treated as a useful perceptual signal rather than as nuisance noise, enabling viewpoint adjustment during deployment.
The policy predicts object trajectories in task space, which enables zero-shot deployment across different robot platforms.
The system is organized as a complete perception-manipulation pipeline. During data collection, a human operator performs tasks with bare hands while wearing smart glasses equipped with stereo vision and 6-DoF head tracking.
During post-processing, the system estimates depth, reconstructs point clouds, segments hands and manipulated objects, and extracts object trajectories in a unified world frame. The learned 3D policy then predicts both manipulation actions and head movement so that the robot can reproduce active perception at test time.
ActiveGlasses combines XREAL Air 2 Ultra for head tracking with a ZED Mini stereo camera for perception. A lightweight on-device interface supports episode control through gesture and audio feedback.
We use the XREAL device specifically for stable 6-DoF trajectory tracking, and use that head motion signal as supervision for active-viewpoint behavior.
The policy uses point clouds in the world frame as input and predicts absolute object trajectories together with relative head motion. This representation stabilizes active-view observations and improves cross-embodiment transfer.
The dashed box and the abs/rel labels indicate the ablation space studied in the paper: we explicitly compare current-object-pose conditioning and absolute-versus-relative trajectory representations.
The evaluation focuses on three real-world tasks that require purposeful camera movement. In all cases, active viewpoint adjustment is necessary before the manipulation arm can reliably complete the task.
Third-person views of book placement rollouts on the two robot platforms used in the cross-embodiment evaluation.
The paper compares ActiveGlasses against a variant without active vision and a pi0.5 baseline. Under the same single active-camera setting, ActiveGlasses consistently achieves higher final-stage success rates on all three tasks.
Comparison under the same single active-camera setup.
| Method | Book Placement | Bread Insertion | Occluded Pouring |
|---|---|---|---|
| ActiveGlasses | 14/20 | 11/20 | 10/20 |
| w/o active vision | 7/20 | 0/20 | 4/20 |
| pi0.5 baseline | 7/20 | 6/20 | 4/20 |
On book placement, the paper reports that absolute object trajectory prediction without current-object-pose conditioning works best.
The paper studies both representation choices and deployment transfer, showing that the policy is not only effective but also portable across platforms.
The paper finds that predicting absolute object trajectories while avoiding additional conditioning on the current object pose works better than relative alternatives. This reduces overfitting to dominant motion patterns and encourages the model to rely on scene observation.
Because the policy predicts object trajectories rather than robot-specific actions, it can realize zero-shot deployment on different robot platforms, including Flexiv Rizon 4 and UR5. On book placement, the first two stages remain fully successful on both platforms.
@article{zou2026activeglasses,
title={ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration},
author={Zou, Yanwen and Shi, Chenyang and Yu, Wenye and Xue, Han and Lv, Jun and Pan, Ye and Wen, Chuan and Lu, Cewu},
journal={Manuscript},
year={2026}
}