Soft Computing based Dual Way Data Modification to Deal with Data Imbalance Problem: Applied to Churn Prediction in Credit Card Users

M.A.H. Farquad; Patlolla Venkat Reddy; Mohammad Sanaullah Qaseem; Syeda Husna Mehanoor

doi:https://doi.org/10.14445/22315381/IJETT-V71I11P204

Research Article | Open Access | Download PDF

Volume 71 | Issue 11 | Year 2023 | Article Id. IJETT-V71I11P204 | DOI : https://doi.org/10.14445/22315381/IJETT-V71I11P204

Soft Computing based Dual Way Data Modification to Deal with Data Imbalance Problem: Applied to Churn Prediction in Credit Card Users

M.A.H. Farquad, Patlolla Venkat Reddy, Mohammad Sanaullah Qaseem, Syeda Husna Mehanoor

Received	Revised	Accepted	Published
03 Aug 2023	20 Sep 2023	20 Oct 2023	04 Nov 2023

Citation :

M.A.H. Farquad, Patlolla Venkat Reddy, Mohammad Sanaullah Qaseem, Syeda Husna Mehanoor, "Soft Computing based Dual Way Data Modification to Deal with Data Imbalance Problem: Applied to Churn Prediction in Credit Card Users," International Journal of Engineering Trends and Technology (IJETT), vol. 71, no. 11, pp. 33-44, 2023. Crossref, https://doi.org/10.14445/22315381/IJETT-V71I11P204

Abstract

The data generated by the industry is imbalanced in nature, with nil or least number of samples about customers who are very important to the business, and the industry cannot take chances of losing them to their competitors. Hence, it becomes highly impossible to understand who is important and who is not. It is also a fact that soft computing algorithms tend to produce sub-optimal solutions using imbalanced training data. This paper proposes a data modification procedure to deal with the data imbalance problem. The proposed approach consists of three major steps, viz. (i) feature ranking, (ii) support vector extraction and vector modification and (iii) prediction. Feature ranking is first employed, and top features are selected for further processing. Support vectors are extracted using SVM, and target values of the extracted SVs are replaced with the predictions of trained SVM models, resulting in SV(P) data. Later, during the prediction step, various classifiers are evaluated. The dataset analyzed in this research study pertains to churn prediction in bank credit card customers, with only 6.76% of the samples representing a churner (shifting loyalties to competitors). The classifier’s sensitivity has been accorded the highest priority while evaluating the classification algorithms in this research. It is observed that the soft computing techniques employed in this study outperformed and yielded better sensitivity using the proposed modified SVs(P) data compared to the results obtained using other training data.

Keywords

Feature ranking, Data modification, Churn prediction, Class imbalance problem, Support Vector Machine.

References

[1] Haibo He, and Edwardo A. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Liuzhi Yin et al., “Feature Selection for High-Dimensional Imbalanced Data,” Neurocomputing, vol. 105, pp. 3-11, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[3] N.V. Chawla et al., “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321- 357, 2002.
[CrossRef] [Google Scholar] [Publisher Link]
[4] R. Barendela et al., “Strategies for Learning in Class Imbalance Problems,” Pattern Recognition, vol. 36, no. 3, pp. 849-851, 2003.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Gustavo E.A.P.A. Batista, Maria C. Monard, and Ana L.C. Bazzan, “Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets,” Knowledge Exploration in Life Science Informatics (KELSI), pp. 20-32, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Hongyn Guo, and Herna L. Viktor, “Learning from Imbalanced Data Sets with Boosting and Data Generation: The Data Boosting Approach,” ACM SIGKDD Explorations, vol. 6, no. 1, pp. 30-39, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Taeho Jo, and Nathalie Japkowicz, “Class Imbalances Versus Small Disjuncts,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 40-49, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Gilles Cohen et al., “Learning from Imbalanced Data in Surveillance of Nosocomial Infection,” Artificial Intelligence in Medicine, vol. 37, no. 1, pp. 7-18, 2006.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao, “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,” International Conference on Intelligent Computing (ICIC 2005), Advances in Intelligent Computing, pp. 878-887, 2005.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Yang Liu et al., “A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection In Speech,” Computer Speech and Language, vol. 20, no. 4, pp. 468-494, 2006.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Alberto Fernández, María José Del Jesus, and Francisco Herrera, “On the 2-Tuples Based Genetic Tuning Performance for Fuzzy Rule Based Classification Systems in Imbalanced Datasets,” Information Sciences, vol. 180, no. 8, pp. 1268-1291, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Yang Liu et al., “Combining Integrated Sampling with SVM Ensembles for Learning from Imbalanced Datasets,” Information Processing and Management, vol. 47, no. 4, pp. 617-631, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Mina Alibeigi, Sattar Hashemi, Ali Hamzeh, “DBFS: An Effective Density Based Feature Selection Scheme for Small Sample Size and High Dimensional Imbalanced Data Sets,” Data and Knowledge Engineering, vol. 81, pp. 67-103, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Ming Gao et al., “A Combined SMOTE and PSO Based RBF Classifier for Two-Class Imbalanced Problems,” Neurocomputing, vol. 74, no. 17, pp. 3456-3466, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Adnan Idris, Muhammad Rizwan, and Asifullah Kham, “Churn Prediction in Telecom Using RANDOM Forest and PSO Based Data Balancing in Combination with Various Feature Selection Strategies,” Computers and Electrical Engineering, vol. 38, no. 6, pp. 1808- 1819, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Sukarna Barua et al., “MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 405-425, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Chris Seiffert et al., “An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data,” Information Sciences, vol. 259, pp. 571-595, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Annarita D’Addabbo, and Rosalia Maglietta, “Parallel Selective Sampling Method for Imbalanced and Large Data Classification,” Pattern Recognition Letters, vol. 62, pp. 61-67, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Shounak Datta, and Swagatam Das, “Near-Bayesian Support Vector Machines for Imbalanced Data Classification with Equal or Unequal Misclassification Costs,” Neural Networks, vol. 70, pp. 39-52, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Zhongbin Sun et al., “A Novel Ensemble Method for Classifying Imbalanced Data,” Pattern Recognition, vol. 48, no. 5, pp. 1623-1637, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Min Zhu et al., “Class Weights Random Forest Algorithm for Processing Class Imbalanced Medical Data,” IEEE Access, vol. 6, pp. 4641- 4652, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Chih-Fong Tsai et al., “Under-Sampling Class Imbalanced Datasets by Combining Clustering Analysis and Instance Selection,” Information Sciences, vol. 477, pp. 47-54, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Min Li et al., “ACO Resampling: Enhancing the Performance of Oversampling Methods for Class Imbalance Classification,” Knowledge Based Systems, vol. 196, pp. 1-17, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[24] M. Aldiki Febriantono et al., “Classification of Multiclass Imbalanced Data Using Cost-Sensitive Decision Tree C50,” IAES International Journal of Artificial Intelligence, vol. 9, no. 1, pp. 65-72, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Mylam Chinnappan Babu, and Sangaralingam Pushpa, “Genetic Algorithm-Based PCA Classification for Imbalanced Dataset,” Intelligent Computing in Engineering, pp. 541-552, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Jong Hyok Ri, and Hun Kim, “G-mean Based Extreme Learning Machine for Imbalance Learning,” Digital Signal Process, vol. 98, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Seba Susan, and Amitesh Kumar, “Hybrid of Intelligent Minority Oversampling and PSO-based Intelligent Majority Undersampling for Learning From Imbalanced Datasets,” International Conference on Intelligent Systems Design and Applications, ISDA 2018, pp. 760-769, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Arwa A. Jamjoom, “The Use of Knowledge Extraction in Predicting Customer Churn in B2B,” Journal of Big Data, vol. 8, no. 110, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Praveen Lalwani et al., “Customer Churn Prediction System: A Machine Learning Approach,” Computing, vol. 104, no. 8, pp. 271-294, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Takuma Kimura, “Customer Churn Prediction with Hybrid Resampling and Ensemble Learning,” Journal of Management Information and Decision Sciences, vol. 25, no. 1, pp. 1-23, 2022.
[Google Scholar] [Publisher Link]
[31] Rencheng Liu et al., “An Intelligent Hybrid Scheme for Customer Churn Prediction Integrating Clustering and Classification Algorithms,” Applied Sciences, vol. 12, no. 18, pp. 1-17, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Yan Zhang, and Lin Chen, “A Study on Forecasting the Default Risk of Bond Based on XGboost Algorithm and Over-sampling Method,” Theoretical Economics Letters, vol. 11, no. 2, pp. 258-267, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Business Intelligence Cup - 2004: Organized by the University of Chile. [Online]. Available: http://www.tis.cl/bicup_04/textbicup/BICUP/202004/20public/20data.zip.
[34] Mierswa, and R. Klinkenberg, RapidMiner Studio, RapidMiner Account, 2018. [Online]. Available : https://rapidminer.com/
[35] V.P. Eswaramurthy, and S. Induja, “A Study on Customer Rentention Using Predictive Data Mining Techniques,” International Journal of Computer and organization Trends (IJCOT), vol. 4, no. 5, pp. 6-10, 2014.
[CrossRef] [Publisher Link]