PoseDiff

PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control

¹The University of Manchester ²X-Humanoid ³Peking University ⁴Northeastern University
^*Corresponding authors

Abstract

We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework. At its core, PoseDiff maps raw visual observations into structured robot states—such as 3D keypoints or joint angles—from a single RGB image, eliminating the need for multi-stage pipelines or auxiliary modalities. Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics: by conditioning on sparse video keyframes generated by world models, it produces smooth and continuous long-horizon action sequences through an overlap-averaging strategy. This unified design enables scalable and efficient integration of perception and control. On the DREAM dataset, PoseDiff achieves state-of-the-art accuracy and real-time performance for pose estimation. On Libero-Object manipulation tasks, it substantially improves success rates over existing inverse dynamics modules, even under strict offline settings. Together, these results show that PoseDiff provides a scalable, accurate, and efficient bridge between perception, planning, and control in embodied AI.

Comparison on Libero-Object ("Pick up the alphabet soup and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Dense Action Sequence
Generated by RoboPEPP

Dense Action Sequence
Generated by Seer

BibTeX

@misc{zhang2025posediffunifieddiffusionmodel, title={PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control}, author={Haozhuo Zhang and Michele Caprio and Jing Shao and Qiang Zhang and Jian Tang and Shanghang Zhang and Wei Pan}, year={2025}, eprint={2509.24591}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2509.24591}, }

PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control

Abstract

Framework Comparison

Model Architecture

PoseDiff architecture. Visual features from a ResNet and timestep embeddings are fused via a condition encoder with FiLM modulation, guiding the denoising of noisy 3D keypoints into clean pose estimates.

Inverse Dynamics Model

PoseDiff as an inverse dynamics model: (a) the world model generates sparse video frames from an initial image and language instruction; (b) PoseDiff fills in dense actions between frame pairs, averaging overlaps for smooth and consistent trajectories.

Experiments Results

Robot Pose Estimation

Comparison on Libero-Object ("Pick up the alphabet soup and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Dense Action Sequence
Generated by RoboPEPP

Dense Action Sequence
Generated by Seer

Visualization on Libero-Object ("Pick up the cream cheese and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Visualization on Libero-Object ("Pick up the salad dressing and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Visualization on Libero-Object ("Pick up the ketchup and place it in the basket.)

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

BibTeX

PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control

Abstract

Framework Comparison

Model Architecture

PoseDiff architecture. Visual features from a ResNet and timestep embeddings are fused via a condition encoder with FiLM modulation, guiding the denoising of noisy 3D keypoints into clean pose estimates.

Inverse Dynamics Model

PoseDiff as an inverse dynamics model: (a) the world model generates sparse video frames from an initial image and language instruction; (b) PoseDiff fills in dense actions between frame pairs, averaging overlaps for smooth and consistent trajectories.

Experiments Results

Robot Pose Estimation

Comparison on Libero-Object ("Pick up the alphabet soup and place it in the basket.)

Sparse Video Frames Generated by World Model

Dense Action Sequence Generated by PoseDiff

Dense Action Sequence Generated by RoboPEPP

Dense Action Sequence Generated by Seer

Visualization on Libero-Object ("Pick up the cream cheese and place it in the basket.)

Sparse Video Frames Generated by World Model

Dense Action Sequence Generated by PoseDiff

Visualization on Libero-Object ("Pick up the salad dressing and place it in the basket.)

Sparse Video Frames Generated by World Model

Dense Action Sequence Generated by PoseDiff

Visualization on Libero-Object ("Pick up the ketchup and place it in the basket.)

Sparse Video Frames Generated by World Model

Dense Action Sequence Generated by PoseDiff

BibTeX

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Dense Action Sequence
Generated by RoboPEPP

Dense Action Sequence
Generated by Seer

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff

Sparse Video Frames
Generated by World Model

Dense Action Sequence
Generated by PoseDiff