Project Website

ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

1Shanghai Jiao Tong University, 2Shanghai Innovation Institute, 3Noematrix Ltd.
*Equal contribution. Corresponding authors.
Active Vision Robot Manipulation Imitation Learning Object-centric Policy Zero-shot Transfer
A head-mounted stereo camera is used during both human demonstration and robot deployment, while an object-centric 3D policy jointly predicts manipulation and viewpoint motion.

Overview video of the full ActiveGlasses pipeline, from egocentric human demonstration to robot deployment.

Abstract

Large-scale real-world robot data collection is a prerequisite for bringing robots into everyday deployment. However, existing pipelines often rely on specialized handheld devices, which increase operator burden, limit scalability, and fail to capture the naturally coordinated perception-manipulation behavior of daily human interaction.

ActiveGlasses introduces a head-mounted system for learning robot manipulation from ego-centric human demonstrations with active vision. A stereo camera mounted on smart glasses is used as the only perception device during both collection and inference: the human wears it during bare-hand demonstrations, and the same setup is mounted on a 6-DoF perception arm at deployment time.

To bridge the embodiment gap, the method extracts object trajectories and trains an object-centric point-cloud policy that jointly predicts manipulation and viewpoint motion. Across tasks involving occlusion and high precision interaction, the system outperforms strong baselines under the same hardware setup and transfers across two robot platforms.

Motivation and Contributions

To truly capture human intelligence, the robot data collection process must inherently align with human nature across two fundamental dimensions: manipulation and perception. In ActiveGlasses, we propose:

Bare-hand Collection

Existing systems often require teleoperation or handheld devices. ActiveGlasses instead keeps data collection close to natural human behavior.

Active Vision

Human head motion is treated as a useful perceptual signal rather than as nuisance noise, enabling viewpoint adjustment during deployment.

Cross-embodiment Policy

The policy predicts object trajectories in task space, which enables zero-shot deployment across different robot platforms.

Method Overview

ActiveGlasses pipeline overview

The system is organized as a complete perception-manipulation pipeline. During data collection, a human operator performs tasks with bare hands while wearing smart glasses equipped with stereo vision and 6-DoF head tracking.

During post-processing, the system estimates depth, reconstructs point clouds, segments hands and manipulated objects, and extracts object trajectories in a unified world frame. The learned 3D policy then predicts both manipulation actions and head movement so that the robot can reproduce active perception at test time.

Hardware and interface design

Hardware and Interface

ActiveGlasses combines XREAL Air 2 Ultra for head tracking with a ZED Mini stereo camera for perception. A lightweight on-device interface supports episode control through gesture and audio feedback.

We use the XREAL device specifically for stable 6-DoF trajectory tracking, and use that head motion signal as supervision for active-viewpoint behavior.

Policy design and action representation

Policy Design

The policy uses point clouds in the world frame as input and predicts absolute object trajectories together with relative head motion. This representation stabilizes active-view observations and improves cross-embodiment transfer.

The dashed box and the abs/rel labels indicate the ablation space studied in the paper: we explicitly compare current-object-pose conditioning and absolute-versus-relative trajectory representations.

Real-world Task Illustrations

Task setup for book placement, bread insertion, and occluded water pouring

The evaluation focuses on three real-world tasks that require purposeful camera movement. In all cases, active viewpoint adjustment is necessary before the manipulation arm can reliably complete the task.

Animated result for the book placement task.

Book Placement

Animated result for the bread insertion task.

Bread Insertion

Animated result for the occluded distant water pouring task.

Occluded Distant Water Pouring

Cross-platform Book Placement

Third-person views of book placement rollouts on the two robot platforms used in the cross-embodiment evaluation.

Third-person book placement rollout on Flexiv Rizon 4.

Book Placement on Flexiv Rizon 4

Third-person book placement rollout on UR5.

Book Placement on UR5

Quantitative Results

The paper compares ActiveGlasses against a variant without active vision and a pi0.5 baseline. Under the same single active-camera setting, ActiveGlasses consistently achieves higher final-stage success rates on all three tasks.

Final-stage Success Rates

Comparison under the same single active-camera setup.

Method Book Placement Bread Insertion Occluded Pouring
ActiveGlasses 14/20 11/20 10/20
w/o active vision 7/20 0/20 4/20
pi0.5 baseline 7/20 6/20 4/20
Representation Ablation

On book placement, the paper reports that absolute object trajectory prediction without current-object-pose conditioning works best.

absolute, w/o current pose
14/20
absolute, w current pose
3/20
relative, w current pose
10/20
relative, w/o current pose

Ablation and Cross-embodiment Takeaways

The paper studies both representation choices and deployment transfer, showing that the policy is not only effective but also portable across platforms.

Policy Ablation

The paper finds that predicting absolute object trajectories while avoiding additional conditioning on the current object pose works better than relative alternatives. This reduces overfitting to dominant motion patterns and encourages the model to rely on scene observation.

Cross-robot Transfer

Because the policy predicts object trajectories rather than robot-specific actions, it can realize zero-shot deployment on different robot platforms, including Flexiv Rizon 4 and UR5. On book placement, the first two stages remain fully successful on both platforms.

BibTeX

@article{zou2026activeglasses,
  title={ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration},
  author={Zou, Yanwen and Shi, Chenyang and Yu, Wenye and Xue, Han and Lv, Jun and Pan, Ye and Wen, Chuan and Lu, Cewu},
  journal={Manuscript},
  year={2026}
}