Potential Web Content Identification and Classification System using NLP and Machine Learning Techniques

T. B. Lalitha; P. S. Sreeja

doi:https://doi.org/10.14445/22315381/IJETT-V71I4P235

Research Article | Open Access | Download PDF

Volume 71 | Issue 4 | Year 2023 | Article Id. IJETT-V71I4P235 | DOI : https://doi.org/10.14445/22315381/IJETT-V71I4P235

Potential Web Content Identification and Classification System using NLP and Machine Learning Techniques

T. B. Lalitha, P. S. Sreeja

Received	Revised	Accepted	Published
10 Jan 2023	17 Apr 2023	21 Apr 2023	25 Apr 2023

Citation :

T. B. Lalitha, P. S. Sreeja, "Potential Web Content Identification and Classification System using NLP and Machine Learning Techniques," International Journal of Engineering Trends and Technology (IJETT), vol. 71, no. 4, pp. 403-415, 2023. Crossref, https://doi.org/10.14445/22315381/IJETT-V71I4P235

Abstract

Nowadays, the volume of educational content on the world wide web is surging rapidly, challenging users with numerous options for e-Learning content in various areas of interest. This transition paves the way for web data mining and classification for identifying the most relevant content according to the user's interests and needs. Web mining is a technique to automatically track down and extract patterns from the data on WWW. The purpose of this paper was to analyze and classify web content based on keyword inputs resulting in a database facilitating a new way of data content recommendation for the users. The proposed work aims to scrape the freely accessible unstructured text content on the search engine and preprocess it to structured data using NLP methods. The extracted structured data undergoes an unsupervised learning algorithm for clustering them to obtain the three classified clustered sets of highly impacted, average, and low impacted data contents, which will be further stored in the database for the future recommendation of classified web content pages to the users.

Keywords

e-Learning, web content mining, unsupervised learning, NLP, k-means algorithm, classification, PageRank algorithm.

References

[1] Xiaoguang Qi, and Brian D. Davisona, “Web Page Classification: Features and Algorithms,” ACM Computing Surveys, vol. 41, no. 2, pp 1-31. 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Kavita Sharma, Gulshan Shrivastava, and Vikas Kumar, "Web mining: Today and tomorrow," 3rd International Conference on Electronics Computer Technology, pp. 399-403, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Yeqing Li, "Research on Technology, Algorithm and Application of Web Mining," 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), IEEE, vol. 1, pp. 772-775, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[4] R. Cooley, B. Mobasher, and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web," Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence, pp. 558-567, 1997.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Guandong Xu, Yanchun Zhang, and Lin Li, “Web Content Mining,” Web Mining and Social Networking, Springer, Boston, MA. pp. 71-87, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Anthony Scime, Web Mining: Applications and Techniques, IGI Global, 2005.
[7] M.G. da Costa, and Zhiguo Gong, “Web Structure Mining: An Introduction,” 2005 IEEE International Conference on Information Acquisition, IEEE, p. 6, 2005.
[CrossRef] [Google Scholar] [Publisher Link]
[8] P Ravi Kumar, and Ashutosh Kumar Singh, “Web Structure Mining: Exploring Hyperlinks and Algorithms for Information Retrieval,” American Journal of Applied Sciences, vol. 7, no. 6, p. 840, 2010.
[Google Scholar]
[9] Mahendra Pratap Singh Dohare, Premnarayan Arya, and Aruna Bajpai, “Novel Web Usage Mining for Web Mining Techniques,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 1, pp. 253-262. 2012.
[Google Scholar]
[10] Mahdi Hashemi, “Web Page Classification: A Survey of Perspectives, Gaps, and Future Directions,” Multimed Tools Applications, vol. 79, pp. 11921–11945, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[11] T. Karthikeya et al., "Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques," International Journal of Web Portals, vol.11, no. 2, pp. 41-52, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Nichita Utiu, and Vlad-Sebastian Ionescu, "Learning Web Content Extraction with DOM Features," IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 5-11, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Farman Ali et al., "A Fuzzy Ontology and SVM–Based Web Content Classification System," IEEE Access, vol. 5, pp. 25781-25797, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Atanas Dimitrovski, Ana Gjorgjevikj, and Dimitar Trajanov, “Courses Content Classification Based on Wikipedia and CIP Taxonomy,” ICT Innovations 2017, Communications in Computer and Information Science, Springer, Cham, vol. 778, pp. 140-153, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Sharmila Shinde, Prasanna Joeg, and Sandeep Vanjale, "Web Document Classification using Support Vector Machine," 2017 International Conference on Current Trends in Computer, Electrical, Electronics and Communication, pp. 688-691, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[16] M.Vanathi, "Web Content Mining-A Study," SSRG International Journal of Electrical and Electronics Engineering, vol. 1, no. 1, pp. 23-27, 2014.
[CrossRef] [Publisher Link]
[17] Luis Roberto Jiménez, "Web Page Classification based on Unsupervised Learning using MIME type Analysis," International Conference on COMmunication Systems & NETworkS, pp. 375-377, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Li Deng, Xin Du, and Ji-zhong Shen, “Web Page Classification Based on Heterogeneous Features and a Combination of Multiple Classifiers,” Frontiers of Information Technology & Electronic Engineering, vol. 21, pp. 995–1004, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[19] T.B. Lalitha, and P.S. Sreeja, “Personalised Self-Directed Learning Recommendation,” Procedia Computer Science, vol. 171, pp. 583-592, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Bo Zhao, “Web Scraping,” Encyclopedia of Big Data, pp. 1-3, 2017.
[Google Scholar]
[21] Anand V. Saurkar, Kedar G. Pathare, and Shweta A. Gode, “An Overview on Web Scraping Techniques and Tools,” International Journal on Future Revolution in Computer Science & Communication Engineering, vol. 4, no. 4, pp. 363-367, 2018.
[Google Scholar] [Publisher Link]
[22] Richard Lawson, Web Scraping with Python, Packt Publishing Ltd, 2015.
[Google Scholar] [Publisher Link]
[23] Simon Munzert et al., Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, John Wiley & Sons, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[24] A. Chapagain, Hands-On Web Scraping with Python: Perform Advanced Scraping Operations using Various Python Libraries and Tools Such as Selenium, Regex, and Others, Packt Publishing Ltd, 2019.
[Google Scholar] [Publisher Link]
[25] Elior Vila, Galia Novakova, and Diana Todorova, “Automation Testing Framework for Web Applications with Selenium Webdriver: Opportunities and Threats,” Proceedings of the International Conference on Advances in Image Processing, pp. 144-150, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Bing Liu, “Sentiment Analysis and Subjectivity,” Handbook of Natural Language Processing, vol. 2, pp. 627-666, 2010.
[Google Scholar]
[27] Abdul-Mageed, Muhammad, Mona Diab, and Mohammed Korayem. "Subjectivity and Sentiment Analysis of Modern Standard Arabic," Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 587-591. 2011.
[Google Scholar] [Publisher Link]
[28] Ian Rogers, "The Google Pagerank Algorithm and How it Works," 2002.
[Google Scholar]
[29] Amy N. Langville, and Carl D. Meyer, Google's PageRank and Beyond, Princeton University Press, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Meng Cui, and Songyun Hu, “Search Engine Optimization Research for Website Promotion,” 2011 International Conference of Information Technology, Computer Engineering and Management Sciences, IEEE, vol. 4, pp. 100-103, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Meenakshi Bansal, and Deepak Sharma, “Improving Webpage Visibility in Search Engines by Enhancing Keyword Density using Improved on-Page Optimization Technique,” International Journal of Computer Science and Information Technologies, vol. 6, no. 6, pp. 5347-5352, 2015.
[Google Scholar]
[32] Sanghamitra Bandyopadhyay, and Sriparna Saha, Unsupervised Classification: Similarity Measures, Classical and Metaheuristic Approaches, and Applications, Springer Science & Business Media, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Aristidis Likas, Nikos Vlassis, and Jakob J. Verbeek, “The Global k-Means Clustering Algorithm,” Pattern recognition, vol. 36, no. 2, pp. 451-461, 2003.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Kristina P. Sinaga, and Miin-Shen Yang, "Unsupervised k-Means Clustering Algorithm," IEEE Access, vol. 8, pp. 80716-80727, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Joonas Hämäläinen, Susanne Jauhiainen, and Tommi Kärkkäinen, “Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering,” Algorithms, vol. 10, no. 3, p. 105, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Purnima Bholowalia, and Arvind Kumar, “EBK-Means: A Clustering Technique Based on Elbow Method and K-Means in WSN,” International Journal of Computer Applications, vol. 105, no. 9, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Shiva Asadianfam, Hoshang Kolivand, and Sima Asadianfam “A New Approach for Web Usage Mining Using Case Based Reasoning,” SN Applied Sciences, Springer, vol. 2, p. 1251, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Karan Sukhija, "Semantic Web Mining: An Amalgamation for Knowledge Extraction," SSRG International Journal of Computer Science and Engineering, vol. 2, no. 8, pp. 14-17, 2015.
[CrossRef] [Publisher Link]
[39] Makinde Opeyemi Samuel, Afolabi Ibukun Tolulope, and Oladipupo Olufunke Oyejoke, “A Systematic Review of Current Trends in Web Content Mining,” Journal of Physics: Conference Series, vol. 1299, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Huiran Li, and Yanwu Yang, “Keyword Targeting Optimization in Sponsored Search Advertising: Combining Selection and Matching,” Electronic Commerce Research and Applications, vol. 56, p. 101209, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Mayank Nagpal, and Andrew Petersen, “Keyword Selection Strategies in Search Engine Optimization: How Relevant is Relevance?,” Journal of Retailing, vol. 97, no. 4, pp. 746-763, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Binbin Gu et al., "The Interaction Between Schema Matching and Record Matching in Data Integration," IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 186-199, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Shishir K. Shandilya, and Suresh Jain, "Opinion Extraction & Classification of Reviews from Web Documents," IEEE International Advance Computing Conference, pp. 924-927, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[44] Fatima Almatrooshi et al., “Text and Web Content Mining: A Systematic Review,” Proceedings of International Conference on Emerging Technologies and Intelligent Systems, vol. 299, no. 79-87, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Derar Alassi, and Reda Alhajj, “Effectiveness of Template Detection on Noise Reduction and Websites Summarization,” Information Sciences, Elsevier, vol. 219, pp. 41-72, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Shinde Santaji Krishna, and Joshi Shashank Dattatraya, "Schema Inference and Data Extraction from Templatized Web Pages," International Conference on Pervasive Computing (ICPC), pp. 1-6, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Ms. Anushree Negi, "A Brief Survey on Text Mining, Its Techniques, and Applications," SSRG International Journal of Mobile Computing and Application, vol. 8, no. 1, pp. 1-6, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Faizan Shaikh et al., "SWISE: Semantic Web Based Intelligent Search Engine," 2010 International Conference on Information and Emerging Technologies, pp. 1-5, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Kenan Enes Aydın, and Sefer Baday, "Machine Learning for Web Content Classification," 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), pp. 1-7, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[50] A. Cavalieri et al., “An Intelligent System for the Categorization of Question Time Official Documents of the Italian Chamber of Deputies,” Journal of Information Technology & Politics, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[51] Sandeep Sirsat, "Extraction of Core Contents from Web Pages," International Journal of Engineering Trends and Technology, vol. 8, no. 9, pp. 484-489, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[52] Ankit Dilip Patel, and Vimal N. Pandya, "Web Page Classification Based on Context to the Content Extraction of Articles," 2nd International Conference for Convergence in Technology, pp. 539-541, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[53] Neha Tyagi, and Santosh Kumar Gupta, “Web Structure Mining Algorithms: A Survey,” Big Data Analytics. Advances in Intelligent Systems and Computing, Springer, Singapore, vol 654, pp. 305-317, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[54] Manjunath Pujar, and Monica R Mundada, “A Systematic Review Web Content Mining Tools and its Applications,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 8, 2021.
[CrossRef] [Google Scholar] [Publisher Link]