Approach and Techniques for Precise Prediction of N-Linked Glycosylation from Human Protein using Artificial Intelligence

Approach and Techniques for Precise Prediction of N-Linked Glycosylation from Human Protein using Artificial Intelligence

  IJETT-book-cover           
  
© 2022 by IJETT Journal
Volume-70 Issue-12
Year of Publication : 2022
Author : Mubina Malik, Jaimin N Undavia
DOI : 10.14445/22315381/IJETT-V70I12P213

How to Cite?

Mubina Malik, Jaimin N Undavia, "Approach and Techniques for Precise Prediction of N-Linked Glycosylation from Human Protein using Artificial Intelligence," International Journal of Engineering Trends and Technology, vol. 70, no. 12, pp. 118-126, 2022. Crossref, https://doi.org/10.14445/22315381/IJETT-V70I12P213

Abstract
Glycosylation is the most common post-translational modification of protein in all territories, which plays a significant role in biological processes. Amongst them, n-linked glycosylation is the most crucial modification, which is closely related to certain diseases such as cancer, diabetes, HIV infection, Alzheimer's disease and atherosclerosis, and liver cirrhosis. Recent advancements in biological knowledge are depicted in this article, ultimately targeting the computer science field. Machine learning and deep learning techniques are major keys to predicting various protein modifications. Through the review of several models which have been made existing for prediction and show high accuracy but result as false positives due to the poor biological knowledge, updated datasets and techniques used. Targeting precise prediction, drawbacks of the existing model and discussed parameters and techniques were emphasized to model solution in this paper. In this study, databases were combined, namely UniprotKB, dbPTM, and nGlycositeAtlas, which are experimentally verified and updated with window size 21. This window size is best for the n-linked glycosylation. After combining datasets and removing the redundancy, 11254 unique proteins and 33859 glycosites were received for further study. CD-HIT algorithm was implemented to remove the redundancy with threshold 0.9. These nearby locations for similar pattern sequences have been identified for asparagine residue for n-linked glycosylation. The protein sequence is a combination of 20 amino acids, which were required to convert into numerical form through encoding methods. Various encoding methods have conversed for n-linked glycosylation. With the biological features, amino acid encoding methods such as substitution matrices - Position Specific Scoring Matrix (PSSM) and Physicochemical properties encoding VHSE8 are the vital methods which improve the accuracy in n-linked glycosylation prediction.

Keywords
Artificial intelligence, Deep learning, Human protein, Machine learning, N-linked glycosylation.

References
[1] Kelley W. Moremen, Michael Tiemeyer, and Alison V. Nairn, "Vertebrate Protein Glycosylation: Diversity, Synthesis and Function," Nature Reviews Molecular Cell Biology, vol. 13, no. 7, pp. 448–462, 2012. Crossref, https://doi.org/10.1038/nrm3383
[2] Ząbczyńska M, and Pochec E., “The Role of Protein Glycosylation in Immune System,” Postepy Biochem, vol. 61, no. 2, pp. 129-137, 2015.
[3] Varki A et al., editors.Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press, 2009.
[4] John F. Rakus, and Lara K. Mahal, "New Technologies for Glycomic Analysis: Toward A Systematic Understanding of the Glycome," Annual Review of Analytical Chemistry (Palo Alto Calif), pp. 367-92, 2011. Crossref, https://doi.org/10.1146/annurev-anchem-061010-113951
[5] Celso A Reis, Rudolf Tauber, and Véronique Blanchard, "Glycosylation is a Key in SARS-CoV-2 Infection," Journal of Molecular Medicine, vol. 99, no. 8, pp. 1023–1031, 2021. Crossref, https://doi.org/10.1007/s00109-021-02092-0
[6] Gerald W Hart, and Ronald J Copeland, "Glycomics Hits the Big Time," Cell, vol. 143, no. 5, pp. 672-676, 2010. Crossref, https://doi.org/10.1016/j.cell.2010.11.008
[7] Karin Julenius et al., "Prediction, Conservation Analysis, and Structural Characterization of Mammalian Mucin-Type O-Glycosylation Sites," Glycobiology, vol. 15, no. 2, pp. 153-164, 2005. Crossref, https://doi.org/10.1093/glycob/cwh151
[8] Radjiv Goulabchand et al., "Impact of Autoantibody Glycosylation in Autoimmune Diseases," Autoimmunity Reviews, vol. 13, no.7, pp. 742–750, 2014. Crossref, https://doi.org/10.1016/j.autrev.2014.02.005
[9] Manish Suyal, and Parul Goyal, "A Review on Analysis of K-Nearest Neighbor Classification Machine Learning Algorithms Based on Supervised Learning," International Journal of Engineering Trends and Technology, vol. 70, no. 7, pp. 43-48, 2022. Crossref, https://doi.org/10.14445/22315381/IJETT-V70I7P205
[10] Kai-Yao Huang et al., "dbPTM in 2019: Exploring Disease Association and Cross-Talk of Post-Translational Modifications," Nucleic Acids Research, vol. 47, no. D1, pp. D298-D308, 2019. Crossref, https://doi.org/10.1093/nar/gky1074
[11] Kazuaki Ohtsubo, and Jamey D Marth, "Glycosylation in Cellular Mechanisms of Health and Disease," Cell, vol. 126, no. 5, pp. 855- 867, 2006. Crossref, https://doi.org/10.1016/j.cell.2006.08.019
[12] Nikolaj Blom et al., "Prediction of Post-Translational Glycosylation and Phosphorylation of Proteins from the Amino Acid Sequence," Proteomics, vol. 4, no. 6, pp. 1633-1649, 2004. Crossref, https://doi.org/10.1002/pmic.200300771
[13] Y Gavel, and G von Heijne, "Sequence Differences Between Glycosylated and Non-Glycosylated Asn-X-Thr/Ser Acceptor Sites: Implications for Protein Engineering," Protein Engineering, vol. 3, no. 5, pp. 433-442, 1990. Crossref, https://doi.org/10.1093/protein/3.5.433
[14] Birgit Eisenhaber, and Frank Eisenhaber, "Prediction of Post-Translational Modification of Proteins from their Amino Acid Sequence," Methods in Molecular Biology (Clifton, N.J.), vol. 609, pp. 365-384, 2010. Crossref, https://doi.org/10.1007/978-1-60327-241-4_21
[15] Manikandan Muthu et al., "Insights into Bioinformatic Applications for Glycosylation: Instigating an Awakening towards Applying Glycoinformatic Resources for Cancer Diagnosis and Therapy," International Journal of Molecular Sciences, vol. 21, no. 24, p. 9336, 2020. Crossref, https://doi.org/10.3390/ijms21249336
[16] Ching-Hsuan Chien et al., "N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy," IEEE Access, vol. 8, pp. 165944-165950, 2020. Crossref, https://doi.org/10.1109/ACCESS.2020.3022629
[17] Thejkiran Pitti et al., "N-Glyde: A Two-Stage N-Linked Glycosylation Site Prediction Incorporating Gapped Dipeptides and PatternBased Encoding," Scientific Reports, vol. 9, no. 1, p. 15975, 2019. Crossref, https://doi.org/10.1038/s41598-019-52341-z
[18] Subash C. Pakhrin et al., "DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction," Molecules, vol. 26, no. 23, pp. 7314, 2021. Crossref, https://doi.org/10.3390/molecules26237314
[19] Tian Jipeng, Suma P, and Dr. T.C.Manjunath, "AI, ML and the Eye Disease Detection," SSRG International Journal of Computer Science and Engineering, vol. 7, no. 4, pp. 1-3, 2020. Crossref, https://doi.org/10.14445/23488387/IJCSE-V7I4P101
[20] Pablo Minguez et al., "PTMcode: A Database of Known and Predicted Functional Associations Between Post-Translational Modifications in Proteins," Nucleic Acids Research, vol. 41, pp. 306-311, 2013. Crossref, https://doi.org/10.1093/nar/gks1230
[21] Zhongyan Li et al., "dbptm in 2022: An Updated Database for Exploring Regulatory Networks And Functional Associations of Protein Post-Translational Modifications,” Nucleic Acids Research, vol. 50, no. D1, pp. 471–479, 2022. Crossref, https://doi.org/10.1093/nar/gkab1017
[22] Bingjie Xue et al., "KinPred: A Unified and Sustainable Approach for Harnessing Proteome-Level Human Kinase-Substrate Predictions," PLoS Computational Biology, vol. 17, no. 2, 2021. Crossref, https://doi.org/10.1371/journal.pcbi.1008681
[23] Alex S Holehouse, and Kristen M Naegle, "Reproducible Analysis of Post-Translational Modifications in Proteomes--Application to Human Mutations," PLoS One, vol. 10, no. 12, 2015. Crossref, https://doi.org/10.1371/journal.pone.0144692
[24] Sachin Gavali et al., "RESTful API for iPTMnet: A Resource for Protein Post-Translational Modification Network Discovery," Database: The journal of Biological Databases and Curatio, vol. 2020, 2020. Crossref, https://doi.org/10.1093/database/baz157
[25] Dan Ofer, Nadav Brandes, and Michal Linial., "The Language of Proteins: NLP, Machine Learning & Protein Sequences," Computational and Structural Biotechnology Journal, vol. 19, pp. 1750-1758, 2021. Crossref, https://doi.org/10.1016/j.csbj.2021.03.022
[26] Mihaly Varadi et al., "AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models," Nucleic Acids Research, vol. 50, no. D1, pp. 439–444, 2022. Crossref, https://doi.org/10.1093/nar/gkab1061
[27] Gupta R, and Brunak S., "Prediction of Glycosylation Across the Human Proteome and the Correlation to Protein Function," Pacific Symposium on Biocomputing, Pacific Symposium on Biocomputing, pp. 310-322, 2002.
[28] Stephen E Hamby, and Jonathan D Hirst, "Prediction of Glycosylation Sites Using Random Forests," BMC Bioinformatics, vol. 9, p. 500, 2008. Crossref, https://doi.org/10.1186/1471-2105-9-500
[29] Cornelia Caragea et al., "Glycosylation Site Prediction Using Ensembles of Support Vector Machine Classifiers," BMC Bioinformatics, vol. 8, pp. 438, 2007. Crossref, https://doi.org/10.1186/1471-2105-8-438
[30] Chauhan JS et al., "GlycoPP: A Web Server for Prediction of N- and O-Glycosites in Prokaryotic Protein Sequences," PLoS One, vol. 7, no. 7, 2012.
[31] Jagat Singh Chauhan, Alka Rao, and Gajendra P. S. Raghava, "In Silico Platform for the Prediction of N-, O- and C-Glycosites in Eukaryotic Protein Sequences," Plos One, vol. 8, 2013. Crossref, https://doi.org/10.1371/journal.pone.0067008
[32] Fuyi Li et al., "Glycomine: A Machine Learning-Based Approach for Predicting N-, C- and O-Linked Glycosylation in the Human Proteome," Bioinformatics, vol. 31, no. 9, pp. 1411–1419, 2015. Crossref, https://doi.org/10.1093/bioinformatics/btu852
[33] Ghazaleh Taherzadeh et al., "SPRINT-Gly: Predicting N- and O-Linked Glycosylation Sites of Human and Mouse Proteins by Using Sequence and Predicted Structural Properties," Bioinformatics, vol. 35, no. 20, pp. 4140-4146, 2019. Crossref, https://doi.org/10.1093/bioinformatics/btz215
[34] Kolapo Adetomiwa, "Adoption And Utilization of Artificial Intelligence (Ai) In Poultry Production: Evidence From Smart Agricultural Practices in Nigeria," SSRG International Journal of Agriculture & Environmental Science, vol. 7, no. 3, pp. 46-54, 2020. Crossref, https://doi.org/10.14445/23942568/IJAES-V7I3P106
[35] Fuyi Li et al., "GlycoMine(struct): A New Bioinformatics Tool for Highly Accurate Mapping of the Human N-Linked and O-Linked Glycoproteomes by Incorporating Structural Features," Scientific Reports, vol. 6, 2016. Crossref, https://doi.org/10.1038/srep34595
[36] Benjamin Luke Schulz, "Beyond the Sequon: Sites of N-Glycosylation," Glycosylation, Petrescu, S., Ed., InTech: Rijeka, Croatia, pp. 21–40, 2012. Crossref, https://doi.org/10.5772/50260
[37] Mihai Nita-Lazar et al., "The N-X-S/T Consensus Sequence is Required But not Sufficient for Bacterial N-Linked Protein Glycosylation," Glycobiology, vol. 15, no. 4, pp. 361–367, 2005. Crossref, https://doi.org/10.1093/glycob/cwi019
[38] Mubina Malik, and Jaimin N Undavia, “Trials, Skills, and Future Standpoints of AI-Based Research in Bioinformatics," International Journal of Recent Technology and Engineering, vol. 9, no. 1, pp. 968–972, 2020. Crossref, https://doi.org/10.35940/ijrte.A1920.059120
[39] Alhasan Alkuhlani et al., "Intelligent Techniques Analysis for Glycosylation Site Prediction,” Current Bioinformatics, vol. 16, no. 6, pp. 774-788, 2021. Crossref, https://doi.org/10.2174/1574893615666210108094847
[40] Shisheng Sun et al., “N-GlycositeAtlas: A Database Resource for Mass Spectrometry-Based Human N-Linked Glycoprotein and Glycosylation Site Mapping," Clinical Proteomics, vol. 16, no. 35, pp. 1-11, 2019. Crossref, https://doi.org/10.1186/s12014-019-9254- 0
[41] The UniProt Consortium, "UniProt: The Universal Protein Knowledgebase in 2021," Nucleic Acids Research, vol. 49, no. D1, pp. D480–D489, 2021. Crossref, https://doi.org/10.1093/nar/gkaa1100
[42] Shuichi Kawashima, and Minoru Kanehisa, “Aaindex: Amino Acid Index Database," Nucleic Acids Research, vol. 27, no. 1, pp. 368- 369, 1999. Crossref, https://doi.org/10.1093/nar/27.1.368
[43] Ke Chen, Lukasz Kurgan, and Jishou Ruan, "Optimization of the Sliding Window Size for Protein Structure Prediction," 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 1-7, 2006. Crossref, https://doi.org/10.1109/CIBCB.2006.330959
[44] Vedant Bhatt, and Mohammad Makki, "Artificial Intelligence for Curing Skin Disorders," SSRG International Journal of Computer Science and Engineering, vol. 5, no. 10, pp. 7-9, 2018. Crossref, https://doi.org/10.14445/23488387/IJCSE-V5I10P103
[45] Limin Fu et al., "CD-HIT: Accelerated for Clustering the Next-Generation Sequencing Data," Bioinformatics, vol. 28, no. 23, pp. 3150-3152, 2012. Crossref, https://doi.org/10.1093/bioinformatics/bts565
[46] Xiaoyang Jing et al., “Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 6, pp. 1918–1931, 2020. Crossref, https://doi.org/10.1109/TCBB.2019.2911677
[47] Hesham ElAbd et al., "Amino Acid Encoding for Deep Learning Applications," BMC Bioinformatics, vol. 21, no. 235, pp. 1-14, 2020. Crossref, https://doi.org/10.1186/s12859-020-03546-x
[48] J. T. L. Wang et al., "New Techniques for Extracting Features from Protein Sequences,” IBM Systems Journal, vol. 40, no. 2, pp. 426– 441, 2001. Crossref, https://doi.org/10.1147/sj.402.0426
[49] Gilbert White, and William Seffens, "Using a Neural Network to Back Translate Amino Acid Sequences," Electronic Journal of Biotechnoloy, vol. 1, no. 3, pp. 17–18, 1998.
[50] Michael Beckstette et al., “Fast Index Based Algorithms and Software for Matching Position-Specific Scoring Matrices,” BMC Bioinformatics, vol. 7, no. 389, 2006. Crossref, https://doi.org/10.1186/1471-2105-7-389
[51] Matthew J. Betts, and Robert B. Russell, "Amino Acid Properties and Consequences of Substitutions," Bioinformatics for Geneticists, vol. 317, no. 289, 2003. Crossref, https://doi.org/10.1002/0470867302.ch14
[52] Stephen F. Altschul et al., "Gapped BLAST And PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997. Crossref, https://doi.org/10.1093/nar/25.17.3389
[53] Pablo Minguez et al., "PTMcode v2: A Resource for Functional Associations of Post-Translational Modifications within and Between Proteins," Nucleic Acids Research, vol. 43, pp. 494-502, 2015. Crossref, https://doi.org/10.1093/nar/gku1081
[54] Gwo-Yu Chuang et al., "Computational Prediction of N-Linked Glycosylation Incorporating Structural Properties and Patterns," Bioinformatics, vol. 28, no, 17, pp. 2249–2255, 2012. Crossref, https://doi.org/10.1093/bioinformatics/bts426
[55] Ying Xu et al., "Phoscontext2vec: A Distributed Representation of Residue-Level Sequence Contexts and its Application to General and Kinase-Specific Phosphorylation Site Prediction," Scientific Reports, vol. 8, p. 8240, 2018. Crossref, https://doi.org/10.1038/s41598-018-26392-7