Cross-Modal Scene Semantics and Graph Attention Networks for Human Motion Intention Prediction

Cody C. Hansen; Zhen Ding; Tejas Mishra

Authors

Cody C. Hansen School of Information Technology, University of Cincinnati, Cincinnati, OH, USA.
Zhen Ding Department of Computer Science, University of North Texas, Denton, TX, USA.
Tejas Mishra Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA.

Keywords:

human motion prediction, cross-modal fusion, graph attention networks, scene semantics, autonomous systems, socio-technical infrastructure, fairness

Abstract

Human motion intention prediction is a fundamental capability for autonomous systems operating in shared environments, such as autonomous vehicles, service robots, and intelligent surveillance. Traditional trajectory forecasting approaches primarily rely on observed motion history and simple spatial interactions, often neglecting the rich semantic information embedded in the surrounding scene and the complex relational structure among multiple agents. This paper proposes a comprehensive framework that integrates cross-modal scene semantics with graph attention networks to predict human motion intentions. The architecture fuses visual, depth, and semantic segmentation streams to construct a high-dimensional scene representation, which is then processed through a graph attention network that models dynamic inter-agent and agent-scene relationships. We discuss the structural trade-offs inherent in designing such a system, including the balance between computational latency and prediction accuracy, the fusion strategies for heterogeneous sensor modalities, and the scalability of graph attention mechanisms to dense crowds. Deployment considerations such as real-time inference on edge devices, robustness to sensor degradation, and sustainability of training data pipelines are examined. Furthermore, we address governance and fairness implications, particularly regarding biases in scene semantics and the equitable treatment of diverse pedestrian populations. Through a systems-oriented analysis, this paper highlights how cross-modal scene understanding and relational graph modeling can together enhance the reliability and interpretability of motion intention prediction, while also outlining open challenges for large-scale deployment in socio-technical infrastructures.

References

1. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). Social LSTM: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 961–971).

2. Bartoli, F., Lisanti, G., Ballan, L., & Del Bimbo, A. (2018). Context-aware trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4681–4690).

3. Vemula, A., Muelling, K., & Oh, J. (2018). Social attention: Modeling attention in human crowds. In Proceedings of the IEEE International Conference on Robotics and Automation (pp. 4601–4607).

4. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph attention networks. In International Conference on Learning Representations.

5. Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., & Savarese, S. (2019). SoPhie: An attentive GAN for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1349–1358).

6. Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018). Social GAN: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2255–2264).

7. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

8. Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012). Activity forecasting. In Proceedings of the European Conference on Computer Vision (pp. 201–214).

9. Kosaraju, V., Sadeghian, A., Martín-Martín, R., Reid, I., Rezatofighi, H., & Savarese, S. (2019). Social-BiGAT: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Advances in Neural Information Processing Systems (pp. 137–146).

10. Li, J., Ma, H., & Tomizuka, M. (2019). Conditional generative neural system for probabilistic trajectory prediction. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 6150–6156).

11. Liang, M., Yang, B., Chen, Y., Hu, R., & Urtasun, R. (2019). Multi-task multi-sensor fusion for 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7345–7353).

12. Zhu, P., Zhao, S., Deng, H., & Han, F. (2025). Attentive radiate graph for pedestrian trajectory prediction in disconnected manifolds. IEEE Transactions on Intelligent Transportation Systems.

13. Meyer, G. P., Charland, J., Hegde, D., Laddha, A., & Vallespi-Gonzalez, C. (2019). Sensor fusion for joint 3D object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 1230–1237).

14. Shi, W., & Rajkumar, R. (2020). Point-GNN: Graph neural network for 3D object detection in a point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1711–1719).

15. De Lange, M., et al. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3366–3385.

16. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

17. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning (pp. 1–16).

18. McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). Communication-efficient learning of deep networks from decentralized data. In Proceedings of the International Conference on Artificial Intelligence and Statistics (pp. 1273–1282).

19. Liu, Y., & Wen, J. (2021). Cultural differences in pedestrian behavior: A cross-national study of crossing tendencies. Journal of Safety Research, 77, 152–161.

20. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 77–91).

21. European Parliament. (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council (General Data Protection Regulation). Official Journal of the European Union, L 119, 1–88.

Cross-Modal Scene Semantics and Graph Attention Networks for Human Motion Intention Prediction

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Current Issue

Information

Indexing & Infrastructure