Optimized Horizontal and Vertical Dimension Selection using Hybrid Sampling and Quadratic Discriminant Analysis for Predicting Software Faults

Yuvaraj K; Balaji N V

doi:https://doi.org/10.14445/22315381/IJETT-V73I6P127

Research Article | Open Access | Download PDF

Volume 73 | Issue 6 | Year 2025 | Article Id. IJETT-V73I6P127 | DOI : https://doi.org/10.14445/22315381/IJETT-V73I6P127

Optimized Horizontal and Vertical Dimension Selection using Hybrid Sampling and Quadratic Discriminant Analysis for Predicting Software Faults

Yuvaraj K, Balaji N V

Received	Revised	Accepted	Published
14 Mar 2025	12 May 2025	07 Jun 2025	28 Jun 2025

Citation :

Yuvaraj K, Balaji N V, "Optimized Horizontal and Vertical Dimension Selection using Hybrid Sampling and Quadratic Discriminant Analysis for Predicting Software Faults," International Journal of Engineering Trends and Technology (IJETT), vol. 73, no. 6, pp. 318-335, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I6P127

Abstract

Software fault prediction is significant research intended to ascertain the faults in the software modules by analysing their various parameters. It aims to ensure maximum quality with minimum time, effort, cost, and usage of testing resources for the underlying software. Like any application, the quality of the data prominently stimulates the prediction result of the software fault. Intrinsically, several challenges, such as class imbalance, irrelevant and redundant attributes, and instance noise, exist in the software defect datasets. This irrelevant input slows the underlying prediction model's performance and produces erroneous prediction results. A data preprocessing methodology has been presented to address this problem by properly choosing the vertical and horizontal dimensions to ensure the quality of the input data. To handle data imbalance in the horizontal dimensions, hybrid sampling that uses SMOTE for oversampling and random under-sampling is applied to the data. It also uses the edited k nearest neighbour rule to remove noises. On the other hand, significant attributes from the vertical dimensions of the dataset are identified by applying the quadratic discriminant analysis. Several datasets have been used in the experimental study to assess the suggested preprocessing model's performance. The findings show that the suggested model performs better as it maintains the quality of the pre-processed dataset. The comparative analysis ensures that the suggested model overcomes the difficulties and performs well enough to forecast software module defects with improved quality up to 2.6% to 5.2% of AuC values.

Keywords

Class imbalance, Edited k nearest neighbour rule, Quadratic discriminant analysis, Random sampling, Software defects, Software fault prediction.

References

[1] Kateryna Alekseieva et al., “State Business Support Programs in Wartime Conditions,” Economic Affairs, vol. 68, no. 1s, pp. 231-242, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Florian Tambon et al., “Bugs in Large Language Models Generated Code: An Empirical Study,” Empirical Software Engineering, vol. 30, no. 3, pp. 1-48, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Musa Murtala Abubakar, and Bashiru Lawal, “Exploring the Potential Failure Modes in the Software Development Process,” International Journal of Science for Global Sustainability, vol. 6, no. 3, pp. 94-104, 2020.
[Google Scholar] [Publisher Link]
[4] Santosh S. Rathore, and Sandeep Kumar, “A Study on Software Fault Prediction Techniques,” Artificial Intelligence Review, vol. 51, no. 2, pp. 255-327, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Bahman Arasteh et al., “Sahand: a Software Fault-Prediction Method using Autoencoder Neural Network and K-Means Algorithm,” Journal of Electronic Testing, vol. 40, no. 2, pp. 229-243, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Ebubeogu Amarachukwu Felix, and Sai Peck Lee, “Predicting the Number of Defects in a New Software Version,” PloS One, vol. 15, no. 3, pp. 1-30, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Zhiqiang Li, Jingwen Niu, and Xiao-Yuan Jing, “Software Defect Prediction: Future Directions and Challenges,” Automated Software Engineering, vol. 31, no. 1, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Sushant Kumar Pandey, and Anil Kumar Tripathi, “An Empirical Study Toward Dealing with Noise and Class Imbalance Issues in Software Defect Prediction,” Soft Computing, vol. 25, pp. 13465-13492, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Jianxin Ge, Jiaomin Liu, and Wenyuan Liu, “Comparative Study on Defect Prediction Algorithms of Supervised Learning Software Based on Imbalanced Classification Data Sets,” 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Busan, Korea (South), pp. 399-406, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Li Sheng Kong et al., “A Systematic Review on Software Reliability Prediction via Swarm Intelligence Algorithms,” Journal of King Saud University-Computer and Information Sciences, vol. 36, no. 7, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Wangshu Liu et al., “Empirical Studies of a Two-Stage Data Preprocessing Approach for Software Fault Prediction,” IEEE Transactions on Reliability, vol. 65, no. 1, pp. 38-53, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Saman Riaz, Ali Arshad, and Licheng Jiao, “Rough Noise-Filtered Easy Ensemble for Software Fault Prediction,” IEEE Access, vol. 6, pp. 46886-46899, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Riski Annisa, Didi Rosiyadi, and Dwiza Riana, “Improved Point Center Algorithm for K-Means Clustering to Increase Software Defect Prediction,” International Journal of Advances in Intelligent Informatics, vol. 6, no. 3, pp. 328-339, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Bartlomiej Wójcicki, and Robert Dabrowski, “Applying Machine Learning to Software Fault Prediction,” e-Informatica Software Engineering Journal, vol. 12, no. 1, pp. 1-18, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Jianming Zhan et al., “A Fuzzy C-Means Clustering-Based Hybrid Multivariate Time Series Prediction Framework with Feature Selection,” IEEE Transactions on Fuzzy Systems, vol. 32, no. 8, pp. 4270-4284, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Hemant Kumar, and Vipin Saxena, “Software Defect Prediction Using Hybrid Machine Learning Techniques: A Comparative Study,” Journal of Software Engineering and Applications, vol. 17, no. 4, pp. 155-171, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Abdullah Alsaeedi, and Mohammad Zubair Khan, “Software Defect Prediction using Supervised Machine Learning and Ensemble Techniques: A Comparative Study,” Journal of Software Engineering and Applications, vol. 12, no. 5, pp. 85-100, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Jiaqiang Chen et al., “A Two-Stage Data Preprocessing Approach for Software Fault Prediction,” 2014 Eighth International Conference on Software Security and Reliability (SERE), San Francisco, CA, USA, pp. 20-29, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Ankush Joon, Rajesh Kumar Tyagi, and Krishan Kumar, “Noise Filtering and Imbalance Class Distribution Removal for Optimizing Software Fault Prediction using Best Software Metrics Suite,” 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, pp. 1381-1389, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Chubato Wondaferaw Yohannese, Tianrui Li, and Kamal Bashir, “A Three-Stage Based Ensemble Learning for Improved Software Fault Prediction: An Empirical Comparative Study,” International Journal of Computational Intelligence Systems, vol. 11, no. 1, pp. 1229-1247, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Yangtao Xue et al., “Nonlinear Feature Selection using Gaussian Kernel SVM-RFE for Fault Diagnosis,” Applied Intelligence, vol. 48, no. 10, pp. 3306-3331, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Santosh S. Rathore, and Sandeep Kumar, “A Decision Tree Logic Based Recommendation System to Select Software Fault Prediction Techniques,” Computing, vol. 99, no. 3, pp. 255-285, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Kaijie Xue, Jin Yang, and Fang Yao, “Optimal Linear Discriminant Analysis for High-Dimensional Functional Data,” Journal of the American Statistical Association, vol. 119, no. 546, pp. 1055-1064, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Lina Jia, “A Hybrid Feature Selection Method for Software Defect Prediction,” IOP Conference Series: Materials Science and Engineering, vol. 394, no. 3, pp. 1-10, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Wahaj Alkaberi, and Fatmah Assiri, “Predicting the Number of Software Faults using Deep Learning,” Engineering, Technology & Applied Science Research, vol. 14, no. 2, pp. 13222-13231, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Anum Kalsoom et al., “A Dimensionality Reduction-Based Efficient Software Fault Prediction using Fisher Linear Discriminant Analysis (FLDA),” The Journal of Supercomputing, vol. 74, no. 9, pp. 4568-4602, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Zhicheng Liu, and Aoqian Zhang, “Sampling for Big Data Profiling: A Survey,” IEEE Access, vol. 8, pp. 72713-72726, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Rui Zhang, Feiping Nie, and Xuelong Li, “Self-Weighted Supervised Discriminative Feature Selection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3913-3918, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Haihong Yu, Liangliang Zhang, and Zhanshan Li, “Self-Weighted Supervised Discriminative Feature Selection via Redundancy Minimization,” IEEE Access, vol. 9, pp. 36968-36975, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[30] S. Sathya Bama, and A. Saravanan, “Efficient Classification using Average Weighted Pattern Score with Attribute Rank based Feature Selection,” International Journal of Intelligent Systems and Applications, vol. 10, no. 7, pp. 29-42, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[31] S. Sathya Bama, M.S. Irfan Ahmed, and A. Saravanan, “Average Weight based Pattern Frequency for Performing Outlier Mining in Web Documents,” International Journal of Emerging Technology and Advanced Engineering, vol. 7, no. 9, pp. 702-709, 2017.
[Publisher Link]
[32] Tuong Le, “A Hybrid Approach using Oversampling Technique and Cost-Sensitive Learning for Bankruptcy Prediction,” Complexity, vol. 2019, no. 1, pp. 1-12, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Nitesh V. Chawla et al., “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Herve Donald Teguim Kamdjou, Classification and Variable Selection Using Linear and Quadratic Discriminant Analysis, Bachelor Thesis, University of Duisburg-Essen, 2016. [Online]. Available: https://www.researchgate.net/publication/351664471_Classification_and_Variable_Selection_Using_Linear_and_Quadratic_Discriminant_Analysis
[35] Trevor Hastie, Jerome Friedman, and Robert Tibshirani, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, pp. 106-119, 2008.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Stefan Hrouda-Rasmussen, Quadratic Discriminant Analysis, A Deep Introduction to Quadratic Discriminant Analysis (QDA) with Theory and Python Implementation, Towards Data Science, 2021. [Online]. Available: https://towardsdatascience.com/quadratic-discriminant-analysis-ae55d8a8148a/
[37] Scikit-Learn, Linear and Quadratic Discriminant Analysis, 2025. [Online]. Available: https://scikit-learn.org/stable/modules/lda_qda.html
[38] PROMISE Software Engineering Repository, 2018. [Online]. Available: http://promise.site.uottawa.ca/SERepository
[39] Thomas Zimmermann, Rahul Premraj, and Andreas Zeller, “Predicting Defects for Eclipse,” Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007), Minneapolis, MN, USA, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Martin Shepperd et al., “Data Quality: Some Comments on the NASA Software Defect Datasets,” IEEE Transactions on Software Engineering, vol. 39, pp. 1208-1215, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Thomas J. McCabe, “A Complexity Measure,” IEEE Transactions on Software Engineering, vol. SE-2, no. 4, pp. 308-320, 1976.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Maurice H. Halstead, Elements of Software Science (Operating and Programming Systems Series), Elsevier Science Inc, United States, 1977.
[Google Scholar] [Publisher Link]
[43] S. Sathya Bama, M.S. Irfan Ahmed, and A. Saravanan, “A Survey on Performance Evaluation Measures for information Retrieval Systems,” International Research Journal of Engineering and Technology, vol. 2, no. 2, pp. 1015-1020, 2015.
[Google Scholar] [Publisher Link]
[44] Yu Tang et al., “A Software Defect Prediction Method based on Learnable Three-Line Hybrid Feature Fusion,” Expert Systems with Applications, vol. 239, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Susmita Haldar, and Luiz Fernando Capretz, “Interpretable Software Defect Prediction from Project Effort and Static Code Metrics,” Computers, vol. 13, no. 2, pp. 1-23, 2024.
[CrossRef] [Google Scholar] [Publisher Link]