PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control

1The University of Manchester   2X-Humanoid   3Peking University   4Northeastern University
*Corresponding authors

Logo 1 Logo 2 Logo 3 Logo 4
Description of the image

Abstract

We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework. At its core, PoseDiff maps raw visual observations into structured robot states—such as 3D keypoints or joint angles—from a single RGB image, eliminating the need for multi-stage pipelines or auxiliary modalities. Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics: by conditioning on sparse video keyframes generated by world models, it produces smooth and continuous long-horizon action sequences through an overlap-averaging strategy. This unified design enables scalable and efficient integration of perception and control. On the DREAM dataset, PoseDiff achieves state-of-the-art accuracy and real-time performance for pose estimation. On Libero-Object manipulation tasks, it substantially improves success rates over existing inverse dynamics modules, even under strict offline settings. Together, these results show that PoseDiff provides a scalable, accurate, and efficient bridge between perception, planning, and control in embodied AI.

Framework Comparison

Description of the image

Framework comparison between PoseDiff and existing robot pose estimation methods. HoRoPose requires depth prediction followed by keypoint estimation (44 ms per image); RoboPEPP relies on joint masking and an encoder-decoder predictor (23 ms); RoboKeyGen first estimates 2D keypoints and then lifts them to 3D via a diffusion model (54 ms). In contrast, PoseDiff directly maps a single RGB image to 3D robot keypoints in an end-to-end manner, achieving faster inference (14 ms).

Model Architecture

Description of the image

PoseDiff architecture. Visual features from a ResNet and timestep embeddings are fused via a condition encoder with FiLM modulation, guiding the denoising of noisy 3D keypoints into clean pose estimates.

Inverse Dynamics Model

Description of the image

PoseDiff as an inverse dynamics model: (a) the world model generates sparse video frames from an initial image and language instruction; (b) PoseDiff fills in dense actions between frame pairs, averaging overlaps for smooth and consistent trajectories.

Experiments Results

Description of the image Description of the image Description of the image Description of the image

Robot Pose Estimation

Comparison on Libero-Object ("Pick up the alphabet soup and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Dense Action Sequence
Generated by RoboPEPP

Dense Action Sequence
Generated by Seer

Visualization on Libero-Object ("Pick up the cream cheese and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Visualization on Libero-Object ("Pick up the salad dressing and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Visualization on Libero-Object ("Pick up the ketchup and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

BibTeX

@misc{zhang2025posediffunifieddiffusionmodel,
        title={PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control}, 
        author={Haozhuo Zhang and Michele Caprio and Jing Shao and Qiang Zhang and Jian Tang and Shanghang Zhang and Wei Pan},
        year={2025},
        eprint={2509.24591},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2509.24591},
  }