WholeBodyVLA: Towards Unified Latent VLA for Whole-body Loco-manipulation Control

Haoran Jiang1,2,4*, Jin Chen1,2,4*, Qingwen Bu2, Li Chen2, Modi Shi4,2, Yanjie Zhang3, Delong Li3, Chuanzhe Suo3, Chuang Wang3, Zhihui Peng3†, Hongyang Li2†
1Fudan University, 2OpenDriveLab & MMLab at The University of Hong Kong, 3AgiBot, 4SII
*Equal contribution     Project co-lead
WholeBodyVLA Overview

Overview of WholeBodyVLA. Introducing WholeBodyVLA, a humanoid system that operates on Agibot X2 robot and performs end-to-end humanoid loco–manipulation in large space for the first time. The proposed system achieves consecutive tasks autonomously, including (a-c) basic bimanual grasping, side-step toward the box, and squatting to place; (d-e) squatting to grasp and lift the box and turning to place the box onto the cart; (f-h) grasping the cart handle, pushing the cart forward, and pushing a load of more than 50 kg.

Method Overview

WholeBodyVLA Method

Pipeline of WholeBodyVLA. LAM is pretrained on manipulation and manipulation- aware locomotion videos, yielding unified latent supervision for the VLM. Meanwhile, the LMO RL policy is trained for precise and stable locomotion under disturbances. At runtime, egocentric images and language instructions are encoded by the VLM into latent action tokens, which are decoded (∼10 Hz) into (i) dual-arm joint actions and (ii) locomotion commands executed by LMO at 50 Hz, enabling robust whole-body loco–manipulation.

Real-world Demos

WholeBodyVLA Performance in Complex Tasks

Task 1: Bag Packing

Our Success Cases

WholeBodyVLA (ours)
WholeBodyVLA under visual variation

Failure Cases of Baseline Methods

❌ Stumble to stop
❌ Lose balance and kick the box

Task 2: Box Loading

Our Success Cases

WholeBodyVLA (ours)
WholeBodyVLA under unseen object

Failure Cases of Baseline Methods

❌ Stumble to stop
❌ Lose balance and deviate greatly from the intended direction

Task 3: Cart Pushing

Our Success Cases

WholeBodyVLA (ours)
WholeBodyVLA under unseen heavy load

Failure Cases of Baseline Methods

❌ Deviate from the right direction
❌ Stop too late

Experiments on Robot Generalization and Capability Showcases

Adaptability & Scalability

Generalization Experiments

1. Object Generalization

Demonstrate WholeBodyVLA's robustness to variations in objects appearance and position, layout, and table color.

2. Start-Pose Generalization

Showcase WholeBodyVLA's ability to compose forward advancing, sidestepping, turning, and squatting to handle diverse start-poses (X/Y offsets, orientations, and table heights).

Distance X-Axis

X-axis Distance Generalization Experiment 1
X-axis Distance Generalization Experiment 2 (w/ unseen table color)

Distance Y-Axis

Y-axis Distance Generalization Experiment 1
Y-axis Distance Generalization Experiment 2 (w/ unseen table color)

Orientation

Orientation Generalization Experiment 1
Orientation Generalization Experiment 2 (w/ unseen table color)

Height

Height Generalization Experiment

3. Terrian Generalization

Demonstrate WholeBodyVLA's ability to traverse uneven terrain.

Long-Horizon Bimanual Manipulation

Demonstrate WholeBodyVLA's competence on long-horizon sequences that involve loco-manipualtion and whole-body coordinated actions.

Long-Horizon Bimanual Manipulation with Coordination

What's More

Showcase WholeBodyVLA's scalability to more complex everyday loco-manipulation tasks (e.g., wiping, vacuum cleaning, etc).