Machine Learning Identification of Regulatory Signatures in Oncogene-Driven Transcriptomic Remodeling

Mahesh Pillai; Varun R. Rao; Viktor Erickson; Kang Qiu

Authors

Mahesh Pillai Department of Computer Science, University of New Hampshire, Durham, NH, USA.
Varun R. Rao Department of Computer Science, Colorado State University, Fort Collins, CO, USA.
Viktor Erickson Department of Computer Science, University of Central Florida, Orlando, FL, USA.
Kang Qiu Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA.

Keywords:

machine learning, regulatory signatures, transcriptomic remodeling, oncogene, gene regulatory networks, interpretability, fairness, computational infrastructure

Abstract

The advent of high-throughput transcriptomic profiling has generated vast repositories of gene expression data, yet the extraction of interpretable regulatory signatures that underlie oncogene-driven transcriptional remodeling remains a formidable challenge. Machine learning methods, particularly deep learning architectures, have demonstrated remarkable capacity to model the non-linear and combinatorial interactions that characterize gene regulatory networks. This paper presents a system-level examination of the design, deployment, and governance of machine learning frameworks for identifying regulatory signatures in cancer transcriptomes. We argue that the utility of these models is not solely a function of predictive accuracy but is critically shaped by structural trade-offs involving model interpretability, data heterogeneity, sample size, and computational infrastructure. Through a multi-dimensional analysis that spans architectural choices, training stability, feature selection, and cross-study generalization, we explore how different modeling paradigms capture distinct aspects of regulatory logic. The role of attention mechanisms, graph neural networks, and sparse regularization is assessed in the context of reconstructing transcription factor binding profiles and enhancer-promoter interactions. Infrastructure considerations such as distributed computing, reproducibility, and version control for large-scale RNA-seq data pipelines are discussed as essential components of robust translational research. Furthermore, we examine the ethical and policy implications of deploying such models in clinical decision-making, including fairness across ancestrally diverse populations, transparency in model interpretation, and the risk of reinforcing systemic biases embedded in publicly available genomic databases. By framing the problem within a broader socio-technical context, this work highlights the need for interdisciplinary stewardship of machine learning tools in oncogenomics.

References

1. Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., ... & Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface, 15(141), 20170387.

2. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

3. Alipanahi, B., Delong, A., Weirauch, M. T., & Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8), 831-838.

4. Zhou, J., & Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods, 12(10), 931-934.

5. Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A., & Kim, D. (2015). Methods of integrating data to uncover genotype-phenotype interactions. Nature Reviews Genetics, 16(2), 85-97.

6. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

7. Avsec, Ž., Agarwal, V., Visentin, D., Ledsam, J. R., Grabska-Barwinska, A., Taylor, K. R., ... & Gagneur, J. (2021). Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10), 1196-1203.

8. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.

9. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.

10. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144.

11. Yang, J., Chung, C. I., Koach, J., Liu, H., Navalkar, A., He, H., ... & Shu, X. (2024). MYC phase separation selectively modulates the transcriptome. Nature Structural & Molecular Biology, 31(10), 1567-1579.

12. Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25(1), 44-56.

13. ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74.

14. Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226-1227.

15. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214-226.

16. Martin, A. R., Kanai, M., Kamatani, Y., Okada, Y., Neale, B. M., & Daly, M. J. (2019). Clinical use of current polygenic risk scores may exacerbate health disparities. Nature Genetics, 51(4), 584-591.

17. U.S. Food and Drug Administration. (2021). Artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD) action plan. FDA.

18. Li, T., Sahu, A. K., Talwalkar, A., & Smith, V. (2020). Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3), 50-60.

19. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.

20. Eraslan, G., Avsec, Ž., Gagneur, J., & Theis, F. J. (2019). Deep learning: New computational modelling techniques for genomics. Nature Reviews Genetics, 20(7), 389-403.

21. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., & Elhadad, N. (2015). Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721-1730.

Machine Learning Identification of Regulatory Signatures in Oncogene-Driven Transcriptomic Remodeling

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Current Issue

Information

Indexing & Infrastructure