Enhancing Dropout Prediction with RUSBoost and BalanceCascade Algorithms: Tackling Class Imbalance in Real-World Educational Data in South Korea
Enhancing Dropout Prediction with RUSBoost and BalanceCascade Algorithms: Tackling Class Imbalance in Real-World Educational Data in South Korea |
||
|
||
© 2024 by IJETT Journal | ||
Volume-72 Issue-12 |
||
Year of Publication : 2024 | ||
Author : Haewon Byeon |
||
DOI : 10.14445/22315381/IJETT-V72I12P119 |
How to Cite?
Haewon Byeon, "Enhancing Dropout Prediction with RUSBoost and BalanceCascade Algorithms: Tackling Class Imbalance in Real-World Educational Data in South Korea," International Journal of Engineering Trends and Technology, vol. 72, no. 12, pp. 215-226, 2024. Crossref, https://doi.org/10.14445/22315381/IJETT-V72I12P119
Abstract
Class imbalance presents a significant challenge in machine learning, especially in educational data analytics, where minority class instances are often critical. This study compares the performance of two advanced techniques, RUSBoost and BalanceCascade, for addressing class imbalance using real-world educational datasets in South Korea. We utilized data from the Korean Educational Longitudinal Study (KELS) from 2013 to 2021, focusing on 4,385 first-year university students in 2021. The datasets were preprocessed and categorized into various factors, including personal, family, and school factors. We implemented RUSBoost and BalanceCascade and evaluated their performance using four different base learners: Random Forest, Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Logistic Regression. The models were assessed using multiple performance metrics, including Area Under the Receiver Operating Characteristic Curve (A-ROC), Precision-Recall Curve (A-PRC), Kolmogorov-Smirnov (K-S) statistic, and F-measure. RUSBoost demonstrated superior performance across most metrics and datasets, particularly excelling in A-ROC and K-S statistics. It consistently outperformed BalanceCascade, showing its robustness and efficiency. BalanceCascade, while competitive, showed slightly lower performance, especially with A-PRC and F-measure metrics. The comparative analysis revealed that RUSBoost’s more straightforward and faster approach made it a more practical choice for handling class imbalance in the educational sector. The findings suggest that RUSBoost is a highly effective method for improving classification performance on imbalanced datasets. Its simplicity and efficiency suit real-world applications, including educational data analytics. Future research should explore further enhancements to these techniques and their applicability in other domains. This study provides valuable insights into selecting appropriate methods for class imbalance, contributing to developing fair and accurate predictive models.
Keywords
Class Imbalance, RUSBoost, BalanceCascade, Real-World Data, Machine Learning.
References
[1] Anupam Khan, and Soumya K. Ghosh, “Student Performance Analysis and Prediction in Classroom Learning: A Review of Educational Data Mining Studies,” Education and Information Technologies, vol. 26, no. 1, pp. 205-240, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Fadi Thabtah et al., “Data Imbalance in Classification: Experimental Evaluation,” Information Sciences, vol. 513, pp. 429-441, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Jafar Tanha et al., “Boosting Methods for Multi-Class Imbalanced Data Classification: An Experimental Review,” Journal of Big Data, vol. 7, pp. 1-47, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Mustafa Yağcı, “Educational Data Mining: Prediction of Students’ Academic Performance Using Machine Learning Algorithms,” Smart Learning Environments, vol. 9, no. 1, pp. 1-19, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Ramin Ghorbani, and Rouzbeh Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899-67911, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Sobhan Sarkar et al., “Predicting and Analyzing Injury Severity: A Machine Learning-Based Approach Using Class-Imbalanced Proactive and Reactive Data,” Safety Science, vol. 125, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[7] G. Sambasivam, and Geoffrey Duncan Opiyo, “A Predictive Machine Learning Application in Agriculture: Cassava Disease Detection and Classification with Imbalanced Dataset Using Convolutional Neural Networks,” Egyptian Informatics Journal, vol. 22, no. 1, pp. 27-34, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Roweida Mohammed, Jumanah Rawashdeh, and Malak Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, pp. 243-248, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Keerthana Rajendran, Manoj Jayabalan, and Vinesh Thiruchelvam, “Predicting Breast Cancer Via Supervised Machine Learning Methods on Class Imbalanced Data,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 8, pp. 1-10, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Fadel M. Megahed et al., “The Class Imbalance Problem,” Nature Methods, vol. 18, no. 11, 1270-1272, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Matloob Khushi et al., “A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data,” IEEE Access, vol. 9, pp. 109960-109975, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Ahmad S. Tarawneh et al., “Stop Oversampling for Class Imbalance Learning: A Review,” IEEE Access, vol. 10, pp. 47643-47660, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Shweta Sharma, Anjana Gosain, and Shreya Jain, “A Review of the Oversampling Techniques in Class Imbalance Problem,” Proceedings of the International Conference on Innovative Computing and Communications (ICICC), Springer Singapore, pp. 459-472, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Jiawen Kong et al., “On the Performance of Oversampling Techniques for Class Imbalance Problems,” In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Cham, pp. 84-96, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Fredy Rodríguez-Torres, José F. Martínez-Trinidad, and Jesús A. Carrasco-Ochoa, “An Oversampling Method for Class Imbalance Problems on Large Datasets, Applied Sciences, vol. 12, no. 7, pp. 1-17, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Bin Liu, and Grigorios Tsoumakas, “Dealing with Class Imbalance in Classifier Chains via Random Undersampling,” Knowledge-Based Systems, vol. 192, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Debashree Devi, Saroj K. Biswas, and Biswajit Purkayastha, “A Review on Solution to Class Imbalance Problem: Undersampling Approaches,” Proceedings 2020 International Conference on Computational Performance Evaluation (ComPE), pp. 626-631, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Shuhua Monica Liu, Jiun-Hung Chen, and Zhiheng Liu, “An Empirical Study of Dynamic Selection and Random Under-Sampling for the Class Imbalance Problem,” Expert Systems with Applications, vol. 221, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Khan Md. Hasib et al., “Imbalanced Data Classification Using Hybrid Under-Sampling with Cost-Sensitive Learning Method,” Proceedings of the Edge Analytics: Select Proceedings of 26th International Conference - ADCOM 2020, Springer Singapore, pp. 423-435, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Zhe Wang, Chenjie Cao, and Yujin Zhu, “Entropy and Confidence-Based Undersampling Boosting Random Forests for Imbalanced Problems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12, pp. 5178-5191, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Muhammad Adil et al., “LSTM and Bat-Based RUSBoost Approach for Electricity Theft Detection,” Applied Sciences, vol. 10, no. 12, pp. 1-21, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Dhritiman Adhya, Soumesh Chatterjee, and Ajoy Kumar Chakraborty, “Diagnosis of PV Array Faults Using RUSBoost,” Journal of Control, Automation and Electrical Systems, vol. 34, no. 1, pp. 157-165, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Bhagat Singh Raghuwanshi, and Sanyam Shukla, “Classifying Imbalanced Data Using BalanceCascade-based Kernelized Extreme Learning Machine,” Pattern Analysis and Applications, vol. 23, no. 3, pp. 1157-1182, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Hao Chen et al., “Deep Balanced Cascade Forest: A Novel Fault Diagnosis Method for Data Imbalance,” ISA Transactions, vol. 126, pp. 428-439, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[25] S.K. Gupta et al., “Data Imbalance in Landslide Susceptibility Zonation: Under-Sampling for Class-Imbalance Learning,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 42, pp. 51-57, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Nathalie Japkowicz, “Learning from Imbalanced Data Sets: A Comparison of Various Strategies,” Association for the Advancement of Artificial Intelignece Workshop Paper, pp. 10-15, 2000.
[Google Scholar] [Publisher Link]
[27] Gary M. Weiss, “Mining with Rarity: A Unifying Framework,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 7-19, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Jason Van Hulse, Taghi M. Khoshgoftaar, and Amri Napolitano, “Experimental Perspectives on Learning from Imbalanced Data,” ICML ‘07: Proceedings of the 24th International Conference on Machine Learning, pp. 935-942, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Ricardo Barandela et al., “The Imbalanced Training Sample Problem: Under or Over Sampling?,” Structural, Syntactic, and Statistical Pattern Recognition, pp. 806-814, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao, “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, Proceedings of the ICIC, 3644,” Advances in Intelligent Computing, pp. 878-887, 2005.
[CrossRef] [Google Scholar] [Publisher Link]
[31] J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Francisco, CA, United States, 1993.
[Google Scholar] [Publisher Link]
[32] William W. Cohen, “Fast Effective Rule Induction,” In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, California, pp. 115-123, 1995.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Johannes Fürnkranz, and Gerhard Widmer, “Incremental Reduced Error Pruning,” Proceedings of the International Conference on Machine Learning, New Brunswick, NJ, pp. 70-77, 1994.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Tom M. Mitchell, Machine Learning, McGraw-Hill, pp. 1-28, 1997.
[Google Scholar] [Publisher Link]
[35] Foster Provost, and Tom Fawcett, “Robust Classification For Imprecise Environments,” Machine Learning, vol. 42, no. 3, pp. 203-231, 2001.
[CrossRef] [Google Scholar] [Publisher Link]
[36] A.P. Bradley et al., “Precision–Recall Operating Characteristic (P-ROC) Curves In Imprecise Environments,” Proceedings of the 18th International Conference on Pattern Recognition, Hong Kong, China, vol. 4, pp. 123-127, 2006.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Jesse Davis, and Mark Goadrich, “The Relationship Between Precision–Recall and ROC Curves,” Proceedings of the 23rd International Conference on Machine Learning, pp. 233-240, 2006.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Yanmin Sun et al., “Cost-Sensitive Boosting for Classification of Imbalanced Data,” Pattern Recognition, vol. 40, no. 12, pp. 3358-3378, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[39] N.V. Chawla et al., “SMOTE: Synthetic Minority Oversampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Yoav Freund, and Robert E. Schapire, “Experiments with A New Boosting Algorithm,” Proceedings of the 13th International Conference on Machine Learning, pp. 148-156, 1996.
[CrossRef] [Google Scholar] [Publisher Link]