A Quality Improvement Framework using Conjoint Analysis with Minimum Redundancy Maximum Relevance for Big Data

A Quality Improvement Framework using Conjoint Analysis with Minimum Redundancy Maximum Relevance for Big Data

  IJETT-book-cover           
  
© 2025 by IJETT Journal
Volume-73 Issue-7
Year of Publication : 2025
Author : Sindhu S, Veni S
DOI : 10.14445/22315381/IJETT-V73I7P132

How to Cite?
Sindhu S, Veni S, "A Quality Improvement Framework using Conjoint Analysis with Minimum Redundancy Maximum Relevance for Big Data," International Journal of Engineering Trends and Technology, vol. 73, no. 7, pp.423-442, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I7P132

Abstract
A substantial volume of data in the form of information is purposefully changing in this era of digital information. The volume of digital data is increasing rapidly every second due to the usage of the Internet through various gadgets for all our everyday activities. However, big data plays a substantial role in information retrieval, which helps in predicting future trends. Due to the characteristics of heterogeneity and huge size, processing and analysing the big data to ascertain useful insights is becoming a challenging task. The selection of quality criteria for processing always determines the quality of the outcomes. Using conventional data mining-based preprocessing techniques alone may not provide an effective result for big data, as it faces decisive challenges. Consequently, a suitable feature selection model is required to enhance the quality. This paper presents a framework for selecting an important feature subset that represents the entire dataset with increased quality. The model utilizes conjoint analysis with a minimum redundancy maximum relevance algorithm for selecting significant attributes and a q-gram-based filtering approach for removing redundant and irrelevant instances. According to the analysis, the suggested model improves data quality and yields superior outcomes using fewer variables and instances. Compared to other big data models already in use, the model uses the Spark framework to produce better outcomes, holding a maximum speed-up rate of 89.50 and a maximum increased accuracy rate of 34.72%.

Keywords
Big data, Conjoint analysis, Data quality, Dimensionality reduction, Feature selection.

References
[1] Hamed Ghorban Tanhaei et al., “Predictive Analytics in Customer Behavior: Anticipating Trends and Preferences,” Results in Control and Optimization, vol. 17, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Arunraj Gopalsamy, and B. Radha, “Feature Selection Using Multiple Ranks with Majority Vote-Based Relative Aggregate Scoring Model for Parkinson Dataset,” Proceedings of International Conference on Data Science and Applications: ICDSA 2021, vol. 2, pp. 1-19, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Khaledun Nahar et al., “Mining Educational Data to Predict Students Performance: A Comparative Study of Data Mining Techniques,” Education and Information Technologies, vol. 26, no. 5, pp. 6051-6067, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] C.V. Swetha, Sibi Shaji, and B. Meenakshi Sundaram, “Feature Selection Using Chi-Squared Feature-Class Association Model for Fake Profile Detection in Online Social Networks,” International Conference on Advanced Computing and Intelligent Technologies, Imphal, India, pp. 259-276, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Saravanan Arumugam, “An Effective Hybrid Encryption Model Using Biometric Key for Ensuring Data Security,” International Arab Journal Information Technology, vol. 20, no. 5, pp. 796-807, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Zhiying Fan, “E-Commerce Data Mining Analysis Based on User Preferences and Association Rules,” Scalable Computing: Practice and Experience, vol. 25, no. 3, pp. 1765-1772, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Sonika Gupta, and Sushil Kumar Mehta, “Data Mining-Based Financial Statement Fraud Detection: Systematic Literature Review and Meta-Analysis to Estimate Data Sample Mapping of Fraudulent Companies Against Non-Fraudulent Companies,” Global Business Review, vol. 25, no. 5, pp. 1290-1313, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] K. Vani, and S.P. Swornambiga, “Adaptive Intrusion Detection Framework for Enhanced Cloud Security in Fog and Edge Computing Environments,” International Journal of Advanced Technology and Engineering Exploration, vol. 11, no. 121, pp. 1613-1640, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Liuchao Jin et al., “Big Data, Machine Learning, and Digital Twin Assisted Additive Manufacturing: A Review,” Materials & Design, vol. 244, pp. 1-53, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Stephen Kaisler et al., “Big Data: Issues and Challenges Moving Forward,” 2013 46th Hawaii International Conference on System Sciences, Wailea, HI, USA, pp. 995-1004, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Mallikarjuna Paramesha, Nitin Liladhar Rane, and Jayesh Rane, “Big Data Analytics, Artificial Intelligence, Machine Learning, Internet of Things, and Blockchain for Enhanced Business Intelligence,” Partners Universal Multidisciplinary Research Journal (PUMRJ), vol. 2, no. 3, pp. 110-133, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Mirza Golam Kibria et al., “Big Data Analytics, Machine Learning, and Artificial Intelligence in Next-Generation Wireless Networks,” IEEE Access, vol. 6, pp. 32328-32338, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Nikolaos Stylos, and Jeremy Zwiegelaar, Big Data as a Game Changer: How does it Shape Business Intelligence within a Tourism and Hospitality Industry Context?, Big Data and Innovation in Tourism, Travel, and Hospitality, Springer, Singapore, pp. 163-181, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Dirk Hölscher et al., “A Big Data Quality Preprocessing and Domain Analysis Provisioner Framework Using Cloud Infrastructures,” ALLDATA 2018: The 4th International Conference on Big Data, Small Data, Linked Data and Open Data, Athens, Greece, pp. 53-58, 2018.
[Google Scholar] [Publisher Link]
[15] Ikbal Taleb, and Mohamed Adel Serhani, “Big Data Pre-Processing: Closing the Data Quality Enforcement Loop,” 2017 IEEE International Congress on Big Data (BigData Congress), Honolulu, HI, USA, pp. 498-501, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Katherine Rucinski et al., “Challenges and Opportunities in Big Data Science to Address Health Inequities and Focus the HIV Response,” Current HIV/AIDS Reports, vol. 21, no. 4, pp. 208-219, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Noura AlNuaimi et al., “Streaming Feature Selection Algorithms for Big Data: A Survey,” Applied Computing and Informatics, vol. 18, no. 1/2, pp. 113-135, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Haowen Guan et al., “SLOF: Identify Density-Based Local Outliers in Big Data,” 2015 12th Web Information System and Application Conference (WISA), Jinan, China, pp. 61-66, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Jundong Li, and Huan Liu, “Challenges of Feature Selection for Big Data Analytics,” IEEE Intelligent Systems, vol. 32, no. 2, pp. 9-15, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[20] N.N. Misra et al., “IoT, Big Data and Artificial Intelligence in Agriculture and Food Industry,” IEEE Internet of Things Journal, vol. 9, no. 9, pp. 6305-6324, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Fakhitah Ridzuan, and Wan Mohd Nazmee Wan Zainon, “A Review on Data Quality Dimensions for Big Data,” Procedia Computer Science, vol. 234, pp. 341-348, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Jingran Wang et al., “Overview of Data Quality: Examining the Dimensions, Antecedents, and Impacts of Data Quality,” Journal of the Knowledge Economy, vol. 15, no. 1, pp. 1159-1178, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Adebunmi Okechukwu Adewusi et al., “Business Intelligence in the Era of Big Data: A Review of Analytical Tools and Competitive Advantage,” Computer Science & IT Research Journal, vol. 5, no. 2, pp. 415-431, 2024. [CrossRef] [Google Scholar] [Publisher Link] [24] Yazeed Alkatheeri et al., “The Mediation Effect of Management Information Systems on the Relationship between Big Data Quality and Decision making Quality,” Test Engineering and Management, pp. 12065-12074, 2020.
[Google Scholar]
[25] Anandhi Ramasamy, and Soumitra Chowdhury, “Big Data Quality Dimensions: A Systematic Literature Review,” JISTEM-Journal of Information Systems and Technology Management, vol. 17, pp. 1-13, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Agung Wahyudi, George Kuk, and Marijn Janssen, “A Process Pattern Model for Tackling and Improving Big Data Quality,” Information Systems Frontiers, vol. 20, no. 3, pp. 457-69, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[27] R. Joseph Manoj, M.D. Anto Praveena, and K. Vijayakumar, “An ACO-ANN Based Feature Selection Algorithm for Big Data,” Cluster Computing, vol. 22, no. 2, pp. 3953-3960, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Noha Shehab, Mahmoud Badawy, and H. Arafat Ali, “Toward Feature Selection in Big Data Preprocessing Based on Hybrid Cloud-Based Model,” The Journal of Supercomputing, vol. 78, no. 3, pp. 3226-3265, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Ibrahim M. El-Hasnony et al., “Improved Feature Selection Model for Big Data Analytics,” IEEE Access, vol. 8, pp. 66989-67004, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Ioannis Tsamardinos et al., “A Greedy Feature Selection Algorithm for Big Data of High Dimensionality,” Machine Learning, vol. 108, no. 2, pp. 149-202, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Hanchuan Peng, Fuhui Long, and C. Ding, “Feature Selection based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, 2005.
[CrossRef] [Google Scholar] [Publisher Link]
[32] A. Saravanan, C. Stanly Felix, and M. Umarani, “Maximum Relevancy and Minimum Redundancy Based Ensemble Feature Selection Model for Effective Classification,” Advanced Computing and Intelligent Technologies: Proceedings of ICACIT 2022, Singapore, pp. 131-146. 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Blessy Trencia Lincy S.S., and Suresh Kumar Nagarajan, “MR-mRMR Feature Selection Approach with an Incremental Classifier Model in Big data,” International Journal of Pharmaceutical Research, vol. 10, no. 4, pp. 365- 379, 2018.
[Google Scholar] [Publisher Link]
[34] Blessy Trencia Lincy S.S., and Suresh Kumar Nagarajan, “An Enhanced Pre-Processing Model for Big Data Processing: A Quality Framework,” 2017 International Conference on Innovations in Green Energy and Healthcare Technologies, Coimbatore, India, pp. 1-7, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Thee Zin Win, and Nang Saing Moon Kham, “Mutual Information-Based Feature Selection Approach to Reduce High Dimension of Big Data,” MLMI '18: Proceedings of the International Conference on Machine Learning and Machine Intelligence, Ha Noi Viet Nam, pp. 3-7, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Vinaya Keskar, Jyoti Yadav, and Ajay Kumar, “Perspective of Anomaly Detection in Big Data for Data Quality Improvement,” Materials Today: Proceedings, vol. 51, pp. 532-537, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Ka Yee Wong, and Raymond K. Wong, “Big Data Quality Prediction Informed by Banking Regulation,” International Journal of Data Science and Analytics, vol. 12, no. 2, pp. 147-164, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Sarwar Kamal et al., “A MapReduce Approach to Diminish Imbalance Parameters for Big Deoxyribonucleic Acid Dataset,” Computer Methods and Programs in Biomedicine, vol. 131, pp. 191-206, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Bibhuprasad Sahu et al., “Novel Hybrid Feature Selection using Binary Portia Spider Optimization Algorithm and Fast mRMR,” Bioengineering, vol. 12, no. 3, pp. 1-26, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Kun Yu et al., “A Hybrid Feature-Selection Method Based on mRMR and Binary Differential Evolution for Gene Selection,” Processes, vol. 12, no. 2, pp. 1-21, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Ziqiang Ye et al., “Identification of OSAHS Patients based on ReliefF-mRMR Feature Selection,” Physical and Engineering Sciences in Medicine, vol. 47, no. 1, pp. 99-108, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Kuraganty Phani Rama Krishna, and Ramakrishna Thirumuru, “A Balanced Intrusion Detection System for Wireless Sensor Networks in a Big Data Environment using CNN-SVM Model,” Informatics and Automation, vol. 22, no. 6, pp. 1296-1322, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Osama Mohareb Khaled et al., “Evaluating Machine Learning Models for Predictive Analytics of Liver Disease Detection using Healthcare Big Data,” International Journal of Electrical and Computer Engineering (IJECE), vol. 15, no. 1, pp. 1162-1174, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[44] Jintong Yang, Yiling Guo, and Xinling Cai, “Wildlife Development Prediction Based on Big Data and Bayesian Logistic Regression,” 2024 2nd International Conference on Mechatronics, IoT and Industrial Informatics (ICMIII), Melbourne, Australia, pp. 419-423, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Hitham Al-Manaseer et al., A Novel Big Data Classification Technique for Healthcare Application using Support Vector Machine, Random Forest and J48, Classification Applications with Deep Learning and Machine Learning Technologies, pp. 205-215, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Salvador García et al., “Big Data Preprocessing: Methods and Prospects,” Big Data Analytics, vol. 1, no. 1, pp. 1-22, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Keith D. Foote, Big Data Integration 101: The What, Why and How, Dataversity, 2019. [Online]. Available: https://www.dataversity.net/big-data-integration-101-the-what-why-and-how/
[48] Fakhitah Ridzuan, and Wan Mohd Nazmee Wan Zainon, “A Review on Data Cleansing Methods for Big Data,” Procedia Computer Science, vol. 161, pp. 731-738, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Sebastián Maldonado, Ricardo Montoya, and Julio López, “Embedded Heterogeneous Feature Selection for Conjoint Analysis: A SVM Approach using L1 Penalty,” Applied Intelligence, vol. 46, no. 4, pp. 775-787, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[50] Stefan Burkhardt et al., “Q-Gram Based Database Searching using a Suffix Array (QUASAR),” RECOMB '99: Proceedings of the Third Annual International Conference on Computational Molecular Biology, Lyon, France, pp. 77-83, 1999.
[CrossRef] [Google Scholar] [Publisher Link]
[51] Lingyun Gao et al., “Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification,” Genomics, Proteomics & Bioinformatics, vol. 15, no. 6, pp. 389-395, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[52] Lin Sun et al., “Joint Neighborhood Entropy-Based Gene Selection Method with Fisher Score for Tumor Classification,” Applied Intelligence, vol. 49, no. 4, pp. 1245-1259, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[53] Lu Huijuan et al., “A Hybrid Feature Selection Algorithm for Gene Expression Data Classification,” Neurocomputing, vol. 256, pp. 56-62, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[54] Aiguo Wang et al., “Wrapper-Based Gene Selection with Markov Blanket,” Computers in Biology and Medicine, vol. 81, pp. 11-23, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[55] Lin Sun et al., “Feature Selection Using Neighborhood Entropy-Based Uncertainty Measures for Gene Expression Data Classification,” Information Sciences, vol. 502, pp. 18-41, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[56] Rabia Aziz, C.K. Verma, and Namita Srivastava, “A Novel Approach for Dimension Reduction of Microarray,” Computational Biology and Chemistry, vol. 71, pp. 161-169, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[57] Praveen Tumuluru, and Bhramaramba Ravi, “GOA-Based DBN: Grasshopper Optimization Algorithm-Based Deep Belief Neural Networks for Cancer Classification,” International Journal of Applied Engineering Research, vol. 12, no. 24, pp. 14218-14231, 2017.
[Google Scholar] [Publisher Link]
[58] C. Devi Arockia Vanitha, D. Devaraj, and M. Venkatesulu, “Gene Expression Data Classification Using Support Vector Machine and Mutual Information-Based Gene Selection,” Procedia Computer Science, vol. 47, pp. 13-21, 2015.
[CrossRef] [Google Scholar] [Publisher Link]