Random Permutation-based Hybrid Feature Selection for Software Bug Prediction using Bayesian Statistical Validation

  IJETT-book-cover  International Journal of Engineering Trends and Technology (IJETT)          
  
© 2022 by IJETT Journal
Volume-70 Issue-4
Year of Publication : 2022
Authors : Tamanna, Om Prakash Sangwan
  10.14445/22315381/IJETT-V70I4P216

MLA 

MLA Style: Tamanna, and Om Prakash Sangwan. "Random Permutation-based Hybrid Feature Selection for Software Bug Prediction using Bayesian Statistical Validation." International Journal of Engineering Trends and Technology, vol. 70, no. 4, Apr. 2022, pp. 188-202. Crossref, https://doi.org/10.14445/22315381/IJETT-V70I4P216

APA Style: Tamanna, & Om Prakash Sangwan. (2022). Random Permutation-based Hybrid Feature Selection for Software Bug Prediction using Bayesian Statistical Validation. International Journal of Engineering Trends and Technology, 70(4), 188-202. https://doi.org/10.14445/22315381/IJETT-V70I4P216

Abstract
Software Fault Prediction (SFP) is a key practice in developing quality software. To cater to rising human expectations, the software is getting complex and increasing source code size (adding new functionalities). A strategy like SFP can help detect faults beforehand and avoid software downtime. To reduce the cost of SFP, we propose a Permutation-based hybrid feature selection model (PFS). This model helps remove irrelevant and redundant features without compromising classifier performance. PFS has been compared with five different supervised feature selection methods – Chi-squared, Correlation, Sequential Forward Feature Selection, Sequential Backward Feature Selection, and Mutual Information. Random Forest (RF) classifier is employed, and experimental results (Accuracy, Precision, Recall, and AUC-ROC) were found on Twenty-four different datasets of three public software repositories. Bayesian statistical analysis of AUC-ROC results was carried out, and it was found that PFS was able to outperform other techniques by lower computational time and lower dimensions.

Keywords
Feature selection, Bayesian signed-rank test, ROC-AUC, Fault prediction.

Reference
[1] Devi CA, Kannammal KE, Surendiran B, A Hybrid Feature Selection Model for Software Bug Prediction. Int. J. Comput. Sci. Appl. 2(2) (2012) 25-35.
[2] Gayatri N, Nickolas S, Reddy AV, Performance Analysis and Enhancement of Software Quality Metrics Using Decision Tree-Based Feature Extraction, International Journal of Recent Trends in Engineering. 2(4) (2009) 1-54.
[3] The PROMISE Repository of Software Engineering Databases. [Online]. Available: http://promise.site.uottawa.ca/SERepository
[4] Khan B, Naseem R, Shah MA, Wakil K, Khan A, Uddin MI, Mahmoud M, Software Bug Prediction for Healthcare Big Data: An Empirical Evaluation of Machine Learning Techniques, Journal of Healthcare Engineering. 15 (2021) 2021.
[5] Menzies T, Greenwald J, Frank A, Data Mining Static Code Attributes to Learn Bug Predictors. IEEE Transactions on Software Engineering. 33(1) (2006) 2-13.
[6] Song Q, Jia Z, Shepperd M, Ying S, Liu J, A General Software Bug-Proneness Prediction Framework. IEEE Transactions on Software Engineering. 37(3) (2010) 356-70.
[7] Agarwal S, Tomar D. A, Feature Selection-Based Model for Software Bug Prediction, Assessment. (2014) 65.
[8] Liu S, Chen X, Liu W, Chen J, Gu Q, Chen D, FECAR: A Feature Selection Framework for Software Bug Prediction, In 2014 IEEE 38th Annual Computer Software and Applications Conference, IEEE. 21 (2014) 426-435.
[9] Khoshgoftaar TM, Gao K, Napolitano A, Wald R, A Comparative Study of Iterative and Non-Iterative Feature Selection Techniques for Software Bug Prediction, Information Systems Frontiers. 16(5) (2014) 801-22.
[10] Balogun AO, Basri S, Abdulkadir SJ, Hashim AS. Performance Analysis of Feature Selection Methods in Software Bug Prediction: A Search Method Approach, Applied Sciences. 9(13) (2019) 2764.
[11] Catal, Cagatay, and Banu Diri, Investigating the Effect of Dataset Size, Metrics Sets, and Feature Selection Techniques on Software Fault Prediction Problem, Information Sciences. 179(8) (2009) 1040-1058.
[12] Jakhar, Amit Kumar, and Kumar Rajnish, Software Fault Prediction with Data Mining Techniques by Using Feature Selection Based Models, International Journal on Electrical Engineering & Informatics. 10(3) (2018).
[13] Benavoli, Alessio, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri, A Bayesian Wilcoxon Signed-Rank Test Based on the Dirichlet Process, in International Conference on Machine Learning, PMLR. (2014) 1026-1034.
[14] Benavoli, Alessio, Giorgio Corani, Janez Demšar, and Marco Zaffalon, Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. The Journal of Machine Learning Research. 18(1) (2017) 2653-2688.
[15] Raftery, Adrian E. Bayesian, Model Selection in Social Research, Sociological Methodology. (1995) 111-163.
[16] Wasserstein, Ronald L, and Nicole A. Lazar, The ASA Statement on P-Values: Context, Process, and Purpose. (2016) 129-133.
[17] Trafimow, David, Valentin Amrhein, Corson N. Areshenkoff, Carlos J. Barrera-Causil, Eric J. Beh, Yusuf K. Bilgiç, Roser Bono et al., Manipulating the Alpha Level Cannot Cure Significance Testing, Frontiers in Psychology. 9 (2018) 699.
[18] Ferguson, Thomas S, A Bayesian Analysis of Some Nonparametric Problems, the Annals of Statistics. (1973) 209-230.
[19] Bernardo, José M, and Adrian FM Smith, Bayesian theory, John Wiley & Sons. 405 (2009).
[20] D'Ambros, Marco, Michele Lanza, and Romain Robbes. An Extensive Comparison of Bug Prediction Approaches. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), IEEE. (2010) 31-41.
[21] Shirabad, J. Sayyad, and T. J. Menzies. The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa. (2005).
[22] Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research. 16 (2002) 321-357.
[23] Kraskov, Alexander, Harald Stögbauer, and Peter Grassberger, Estimating Mutual Information, Physical Review. 69(6) (2004) 066138.
[24] Ferri, Francesc J, Pavel Pudil, Mohamad Hatef, and Josef Kittler, Comparative Study of Techniques for Large-Scale Feature Selection, in Machine Intelligence and Pattern Recognition, North-Holland. 16 (1994) 403-413.
[25] Metz, Charles E. Basic Principles of ROC Analysis. In Seminars in Nuclear Medicine, WB Saunders. 8(4) (1978) 283-298.
[26] Fawcett, Tom, An introduction to ROC Analysis, Pattern Recognition Letters. 27(8) (2006) 861-874.
[27] Breiman, Leo, Random Forests, Machine Learning. 45(1) (2001) 5-32.
[28] Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone, Classification and Regression Trees, Routledge. (2017).
[29] Fenton, Norman E, and Martin Neil, A Critique of Software Defect Prediction Models, IEEE Transactions on Software Engineering. 25(5) (1999) 675-689.
[30] Menzies, Tim, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe Bener, Defect Prediction from Static Code Features Current Results, Limitations, New Approaches, Automated Software Engineering. 17(4) (2010) 375-407.
[31] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al., Scikit-Learn: Machine Learning in Python, The Journal of Machine Learning Research. 12 (2011) 2825-2830.
[32] Herbold, Steffen, Autorank: A Python Package for Automated Ranking of Classifiers, Journal of Open Source Software. 5(48) (2020) 2173.