HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Haozhuo Zhang1, Jingkai Sun1, Michele Caprio2, Jian Tang1, Shanghang Zhang3,
Qiang Zhang1, 4*, Wei Pan2*,
1X-Humanoid   2The University of Manchester   3Peking University   4HKUST (Guangzhou)
*Corresponding author

Manchester logo
Description of the image

Abstract

We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations. HumanoidVerse is trained via a multi-stage curriculum using a dual-teacher distillation pipeline, enabling fluid transitions between sub-tasks without requiring environment resets. To support this, we construct a large-scale dataset comprising 350 multi-object tasks spanning four room layouts. Extensive experiments in the Isaac Gym simulator demonstrate that our method significantly outperforms prior state-of-the-art in both task success rate and spatial precision, and generalizes well to unseen environments and instructions. Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints.

Training Process

Description of the image

Multi-stage training pipeline of HumanoidVerse: Stage 1 trains a teacher model via RL for single-object rearrangement. Stage 2 extends it to releasing and stepping back. Stage 3 trains a second teacher model to manipulate a second object from the end state of Stage 2. In Stage 4, the student VLA model, taking real-time visual and language inputs, is distilled from the two teacher models. The first teacher model is used to distill during the initial rearrangement, and the second after releasing and retreating. Both are unified into a single student model for full multi-object rearrangement.

Experiments Results

Description of the image Description of the image

Qualitative Comparison Between HumanVLA (Upper) and HumanoidVerse (Lower) [Bedroom]

"Move the pillow from the table to the center of the bed, then place the laptop on the desk next to the chair."

HumanVLA

HumanoidVerse

Qualitative Comparison Between HumanVLA (Upper) and HumanoidVerse (Lower) [Kitchen]

"Move the coffeemaker from the right side of the kitchen island to the right side of the countertop, then Move the trashbin from the right side of the countertop to the left side of the countertop."

HumanVLA

HumanoidVerse

Qualitative Comparison Between HumanVLA (Upper) and HumanoidVerse (Lower) [Livingroom]

"Move the chair to the right side of the table, then move the yellow vase from the tv stand to the table."

HumanVLA

HumanoidVerse

Qualitative Comparison Between HumanVLA (Upper) and HumanoidVerse (Lower) [Warehouse]

"Lift the right box from the ground to the right of the bottom shelf, then move the other box on the ground to the left of the bottom shelf."

HumanVLA

HumanoidVerse

More Results of HumanoidVerse [Bedroom]

"Move the bedside table from its position to the front side of the bed,
then move the laptop from the bed to the desk."

"Move the laptop from the bed to the desk with an office chair,
then move the chair from the left side of the desk to the front side of the desk."

"Move the pillow from the office table to the left side of the bed,
then move the laptop from the bed to the desk."

More Results of HumanoidVerse [Kitchen]

"Move the coffeemaker from the kitchen island to the right side near the corner,
then move the trashbin from the right side of the countertop to the left side of the countertop."

"Move the trashbin from the right side of the leftmost counter to the floor next to the left side of the same counter,
then move the coffeemaker from the kitchen island to the right side of the countertop, near the sink."

More Results of HumanoidVerse [Livingroom]

"Move the chair from its current position to the right side of the coffee table,
then move the plant from the side table to the center of the wooden table."

"Move the chair to the right side of the table, closer to the center of the room,
then move the blue and white vase from the tv stand to the center table."

"Move the plant from the black table to the center coffee table,
then move the chair to the right side of the console table."

"Move the blue and white vase from the tv stand to the table,
then move the chair to the right side of the room next to the head of the table."

More Results of HumanoidVerse [Warehouse]

"Move the left box on the ground to the right of the left shelf,
then move the other box from the ground to the right of the right shelf."

"Move the right box to the left of the right shelf,
then lift the other box from the ground to the right of the right shelf."

"Move the left box to the left of the left shelf,
then move the other box to the left of the right shelf."

"Move the left box on the ground to the left of the bottom shelf,
then move the other box from the ground to the right of the bottom shelf."

"Move the right box to the right of the bottom shelf,
then move the other box on the ground to the left of the bottom shelf."

BibTeX

@misc{zhang2025humanoidverseversatilehumanoidvisionlanguage,
        title={HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement}, 
        author={Haozhuo Zhang and Jingkai Sun and Michele Caprio and Jian Tang and Shanghang Zhang and Qiang Zhang and Wei Pan},
        year={2025},
        eprint={2508.16943},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2508.16943}, 
  }