Cross-Embodied Vision-Language-Action World Models for Autonomous Driving and Intelligent Robotics Transfer Learning
Keywords:
world models, vision-language-action, transfer learning, autonomous driving, intelligent robotics, socio-technical systemsAbstract
The convergence of vision-language-action models with world model architectures has opened a new frontier in autonomous systems, enabling agents to perceive, reason, and act across diverse embodiments. This paper introduces the concept of cross-embodied vision-language-action world models, a unified framework that facilitates transfer learning between autonomous driving platforms and general-purpose intelligent robots. We argue that such models can overcome the traditional embodiment-specific data scarcity by leveraging shared representations of spatial semantics, task goals, and causal dynamics. The paper examines architectural trade-offs between monolithic and modular world models, the infrastructure requirements for scaling cross-embodied training, and the governance challenges posed by deploying heterogeneous fleets of autonomous agents. We analyze how structural differences in perception modalities, action spaces, and environmental contexts across autonomous driving and robotics affect transfer efficiency and robustness. Through a detailed discussion of system-level design choices, including latent state compression, reward shaping, and sim-to-real continuity, we highlight the importance of balancing generalization capacity with task-specific precision. Policy implications around safety certification, fairness in behavior across socio-economic contexts, and long-term sustainability of large-scale model training are critically assessed. We conclude by outlining future research directions for building adaptive, resilient, and ethically aligned cross-embodied intelligence.
References
1. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Chromik, K., ... & Zitkovich, B. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
2. Driess, D., Xia, F., Sajib, M., Sorokin, A., Tassa, Y., Dehghani, M., ... & Florence, P. (2023). PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
3. Li, Y., Li, S., Li, J., Wang, Y., Liu, W., & Shi, B. (2022). BEVFormer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers. In European Conference on Computer Vision (pp. 1-18). Springer.
4. Huang, D., Dhiman, R., Fox, D., & Chai, J. (2023). Long-horizon multi-task planning via vision-language models. In Conference on Robot Learning (pp. 1-12). PMLR.
5. Shah, R., Kumar, V., & Tulsiani, S. (2023). VLMB: A vision-language model for behavior generation in robotics. arXiv preprint arXiv:2306.04123.
6. Xiong, Z., Ye, X., Yaman, B., Cheng, S., Lu, Y., Luo, J., ... & Ren, L. (2026). UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving. arXiv preprint arXiv:2601.04453.
7. Kalakrishnan, M., Rigamonti, A., & Sukhatme, G. (2021). Transfer learning across robot morphologies: A survey. IEEE Transactions on Robotics, 37(4), 1123–1140.
8. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 23–30).
9. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3674–3683).
10. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q. V., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in Neural Information Processing Systems (pp. 1223–1231).
11. Satyanarayanan, M. (2017). The emergence of edge computing. IEEE Computer, 50(1), 30–39.
12. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650).
13. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (pp. 77–91).
14. Uesato, J., O'Donoghue, B., Kohli, P., & van den Oord, A. (2018). Adversarial risk and the robustness of deep reinforcement learning. In International Conference on Learning Representations.
15. Traeger, L., Seetharam, K., & Pavone, M. (2022). Zero-shot transfer of driving policies to sidewalk robots via shared world models. In IEEE International Conference on Robotics and Automation (pp. 7890–7896).
16. Chen, Y., Liu, Z., & Wu, Y. (2023). Cross-embodiment policy transfer for warehouse navigation using occupancy networks. In Proceedings of the International Conference on Automated Planning and Scheduling (pp. 1–10).
17. Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122.
18. Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., ... & de Freitas, N. (2022). A generalist agent. Transactions on Machine Learning Research.
19. Florence, P., Manuelli, L., & Tedrake, R. (2022). Dense corpus benchmark: Evaluating generalization in robot imitation. In Conference on Robot Learning (pp. 1–12).
20. Xiao, T., Radosavovic, I., Darrell, T., & Malik, J. (2022). Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173.
21. Zhang, W., & Zhang, Y. (2023). Sim-to-real transfer for cross-embodied robotic manipulation via latent alignment. IEEE Robotics and Automation Letters, 8(2), 1123–1130.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Computer Science and Engineering Transactions

This work is licensed under a Creative Commons Attribution 4.0 International License.
This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



