Physics-Grounded Human Motion Forecasting with 3D Scene-Aware Diffusion Models for Embodied AI

Authors

  • Otis Thornton Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA.
  • Akshay Rao Department of Computer Science, University of North Texas, Denton, TX, USA.

Keywords:

human motion forecasting, embodied AI, diffusion models, physics grounding, 3D scene understanding, socio-technical systems

Abstract

Human motion forecasting remains a cornerstone capability for embodied artificial intelligence systems that must operate safely and adaptively in dynamic human environments. Despite significant progress in sequence modeling and generative architectures, existing approaches often produce kinematically plausible yet physically inconsistent trajectories that violate basic constraints of gravity, contact, and object permanence. This paper presents a comprehensive systems-level analysis of physics-grounded human motion forecasting using 3D scene-aware diffusion models. We argue that the integration of differentiable physics simulators with large-scale diffusion backbones enables the generation of motions that are not only visually coherent but also mechanically feasible within a given spatial context. We examine the architectural trade-offs inherent in coupling high-dimensional latent representations with explicit physics losses, the challenges of constructing large-scale annotated datasets that capture both motion and scene geometry, and the infrastructure requirements for real-time deployment in robotics and interactive applications. Furthermore, we discuss the socio-technical implications of such systems, including fairness in motion prediction across diverse populations, robustness under distribution shift, and the governance frameworks needed to ensure responsible use in public-facing embodied agents. By situating technical advances within broader considerations of sustainability, interpretability, and ethical deployment, this paper provides a roadmap for future research that balances predictive fidelity with practical constraints. Our analysis draws on recent breakthroughs in diffusion-based generative modeling, physics simulation, and scene understanding to propose a unified framework for embodied AI that respects the physical laws governing human movement.

References

1. Martinez, J., Black, M. J., & Romero, J. (2017). On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2891–2900).

2. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., ... & Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 35, 36479–36494.

3. Zhang, Y., Black, M. J., & Tang, S. (2021). Perceiving 3D human-object interactions from images by learning implicit surfaces. In Advances in Neural Information Processing Systems, 34, 20243–20255.

4. Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2019). OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 172–186.

5. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems, 27.

6. Todorov, E., Erez, T., & Tassa, Y. (2012). MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 5026–5033).

7. Rempe, D., Birdal, T., Hertz, A., Yang, J., Sridhar, S., & Guibas, L. J. (2021). HuMoR: 3D human motion model for robust pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 11488–11499).

8. Peng, X. B., Abbeel, P., Levine, S., & van der Panne, M. (2018). DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics, 37(4), 1–14.

9. Xiong, Z., Song, Y., He, L., Xiong, W., Yuan, Y., Qiao, F., & Jacobs, N. (2026). PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment. arXiv preprint arXiv:2603.13770.

10. Xie, Z., Jiang, R., & van der Panne, M. (2021). A differentiable contact model for physics-based character animation. In ACM SIGGRAPH Conference Proceedings.

11. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10975–10985).

12. Xia, Z., Hu, Z., Huang, L., & Jiang, Y. (2024). Scene-aware human motion forecasting based on graph diffusion. In European Conference on Computer Vision.

13. Ma, W., Kosecka, J., & Medioni, G. (2022). Semantic scene-aware human motion prediction. In IEEE International Conference on Robotics and Automation.

14. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684–10695).

15. Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.

16. Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., & Black, M. J. (2019). AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5442–5451).

17. Hassan, M., Choutas, V., Tzionas, D., & Black, M. J. (2019). Resolving 3D human pose ambiguities with 3D scene constraints. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2282–2292).

18. Salimans, T., & Ho, J. (2022). Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations.

19. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650).

20. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency (pp. 77–91).

21. Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., ... & Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 33–44).

22. Pearl, J. (2019). The seven tools of causal inference with reflections on machine learning. Communications of the ACM, 62(3), 54–60.

23. Greydanus, S., Dzamba, M., & Yosinski, J. (2019). Hamiltonian neural networks. In Advances in Neural Information Processing Systems, 32.

24. European Commission. (2021). Proposal for a regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). COM(2021) 206 final.

Downloads

Published

2026-05-21

How to Cite

Otis Thornton, & Akshay Rao. (2026). Physics-Grounded Human Motion Forecasting with 3D Scene-Aware Diffusion Models for Embodied AI. Computer Science and Engineering Transactions, 4(1). Retrieved from https://csetx.org/index.php/cset/article/view/144