A Novel Hybrid Features with Ensemble and Data Augmentation for Efficient and Resilient Malware Variant Detection

Azaabi Cletus; Alex Akwasi Opoku; Benjamin Asubam Weyori

doi:https://doi.org/10.14445/22315381/IJETT-V71I8P238

Research Article | Open Access | Download PDF

Volume 71 | Issue 8 | Year 2023 | Article Id. IJETT-V71I8P238 | DOI : https://doi.org/10.14445/22315381/IJETT-V71I8P238

A Novel Hybrid Features with Ensemble and Data Augmentation for Efficient and Resilient Malware Variant Detection

Azaabi Cletus, Alex Akwasi Opoku, Benjamin Asubam Weyori

Received	Revised	Accepted	Published
05 Apr 2023	16 May 2023	14 Jul 2023	15 Aug 2023

Citation :

Azaabi Cletus, Alex Akwasi Opoku, Benjamin Asubam Weyori, "A Novel Hybrid Features with Ensemble and Data Augmentation for Efficient and Resilient Malware Variant Detection," International Journal of Engineering Trends and Technology (IJETT), vol. 71, no. 8, pp. 439-457, 2023. Crossref, https://doi.org/10.14445/22315381/IJETT-V71I8P238

Abstract

The use of Machine Learning (ML) solutions in place of signature-based detection systems is widely explored and settled. However, Poor features for efficient classification, malware obfuscation, class imbalance problem resulting in the accuracy paradox, and the use of conventional ML algorithms remain some of the challenges. The paper proposed a novel hybrid feature set with an ensemble algorithm and data augmentation technique for efficiently detecting obfuscated malware. An imbalance malware dataset (11,678 malware and 3,963 benign ware) was obtained from virusTotal.com and preprocessed. Features were obtained based on the dynamic disassembly of the malware dataset. We extracted only fine-grained API (application programming interface) call features and DLL (dynamic link library) features using the IDA Pro and Volatility tools, respectively. We hybridized these features into an integrated feature set and used them to train Random Forest (RF), Gradient Boosting (GB), and eXtremeGradient Boosting (XGB) ensembles. As a dataset with an imbalance class, we applied Adaptive Synthetic Sampling (ADASYN) to rebalance the dataset to improve performance accuracy. We evaluated the accuracy of the models before and after applying the ADASYN technique to overcome the accuracy paradox. Similarly, we tested the resilience of the models against malware obfuscation by measuring the performance before and after obfuscating the malware dataset. The results show that using ADASYN reduced the accuracies of the models with RF from 99.94% without ADASYN to 99.86%, GB from 99.89% to 99.81%, and XGB from 99.95% to 99.87%. However, F1-Score and AUC appreciated: RF from 83% to 94%, GB from 72% to 83%, and XGB from 85% to 96%. AUC: RF from 86.36% to 95.56%, GB from 84.82% to 94.02%, and XGB from 89.39% to 98.59%. With resilience against obfuscated malware, accuracy, F1-Score, and AUC remain the same before and after malware obfuscation. We concluded that the approach improved classification accuracy and demonstrated resilience against malware obfuscation. This result implies that with the current exponential growth in malware volumes, variety and complexity, using the proposed novel fine-grained features with ensemble technique and ADASYN improved malware classification accuracy and resilience against malware obfuscation. Thus, it presents a huge potential for malware classification in general and obfuscated malware detection in particular.

Keywords

Data augmentation, Ensemble, Features, Hybrid features, Malware, Machine learning, Polymorphic Malware, Signature-based detection.

References

[1] Firas Shihab Ahmed et al., “Preliminary Analysis of Malware Detection in Opcode Sequences within IOT Environment,” Journal of Computer Science, vol. 16, no. 9, pp. 1306-1318, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[2] M. Manjula, Venkatesh, and K. R. Venugopal, “Cyber Security Threats and Countermeasures using Machine Learning and Deep Learning Approaches: A Survey,” Journal of Computer Science, vol. 19, no. 1, pp. 20-56, 2023.
[CrossRef] [Publisher Link]
[3] Pedro Ramos Brandao, “Advanced Persistent Threats (APT)-Attribution-MICTIC Framework Extension,” Journal of Computer Science, vol. 17, no. 5, no. 470-479, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Faisal Alsattam et al., “Rule-Based Approach to Detect IOT Malicious Files,” Journal of Computer Science, vol. 16, no. 9, pp. 1203-1211, 2020.
[CrossRef] [Publisher Link]
[5] AV-Test Institute, Annual Malware statistics, 2023. [Online]. Available: https://www.av-test.org/en/statistics/malware
[6] DBIR Team, Verizon Data Breach Investigations Report 2022. Results and Analysis, 2023. [Online]. Available: https://www.verizon.com/business/resources/reports/dbir/ [7] Asma A. Alhashmi, Abdulbasit Darem, and Jemal H. Abawajy, “Taxonomy of Cybersecurity Awareness Delivery Methods: A Countermeasure for Phishing Threats,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 10, pp. 29-35, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] K. A. Monnappa, Learning Malware Analysis: Explore the Concepts, Tools and the Techniques, Packt Publishing, 2018.
[Google Scholar]
[9] Hussein Aldawood, and Geofrey Skinner, “Reviewing Cyber Security Social Engineering Training and Awareness Programs-Pitfalls and Ongoing Issues,” Future Internet, vol. 11, no. 3, p. 73, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Joshua Saxe, and Hillary Sanders, Malware Data Science: Attack Detection and Attribution, Starch Press, Packt Publishing, 2018.
[Google Scholar]
[11] Paul Joseph, and Jasmine Norman, “Systematic Memory Forensic Analysis of Ransomware using Digital Forensic Tools,” International Journal of Natural Computing Research, vol. 9, no. 2, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Berkant Düzgün et al., “New Datasets for Dynamic Malware Classification,” Cryptography and Security, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Yus Kamalrul Bin Mohamed Yunus, and Syahrulanuar Bin Ngah, “Review of Hybrid Analysis Technique for Malware Detection,” IOP Conference Series: Materials Science and Engineering, vol. 769, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Saleh Alyahyan, “Machine Learning Ensemble Methods for Classifying Multi-Media Data,” A Thesis Submitted for the Degree of Doctor of Philosophy at the University of East Anglia, 2020.
[Google Scholar] [Publisher Link]
[15] Azaabi Cletus, Alex Opoku, and Benjamin Weyory, “Exploring the Performance of Feature Dimensionality Reduction Technique Using Malware Dataset,” International Journal of Computer Science and Network Security, vol. 22, no. 6, pp. 690-696, 2022.
[Publisher Link]
[16] Jagsir Singh, and Jaswinder Singh, “Challenges of Malware Analysis: Malware Obfuscation Techniques,” International Journal of Information Security Science, vol.7, no. 3, pp. 100-110, 2018.
[Google Scholar] [Publisher Link]
[17] Emmanuel Masabo et al., “Improvement of Malware Classification Using Hybrid Feature Engineering,” SN Computer Science, Springer Nature, vol. 1, no. 17, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[18] Hoda El Merabet, and Abderrahmane Hajraoui, “A Survey of Malware Detection Techniques Based on Machine Learning,” (IJACSA) International Journal of Advance Computer Science and Applications, vol. 10, no. 1, pp. 366-373, 2019.
[Google Scholar] [Publisher Link]
[19] Rami Sihwail, Khairuddin Omar, and Khairul Akram Zaninol Ari Sanad Al-Afghani, “Malware Detection Approach Based on Artefacts in Memory Image and Dynamic Analysis,” Applied Sciences, vol. 9, no. 18, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Danny Kim, “Improving Existing Static and Dynamic Malware Detection Techniques with Intrusion-level Behaviour,” Theses and Dissertations from UMD, 2019.
[Google Scholar] [Publisher Link]
[21] Cho Do Xuan et al., “Malicious URL Detection based on Machine Learning,” International Journal of Advance Computer Science and Application, vol. 11, no. 1, 2020. [Google Scholar] [Publisher Link]
[22] Jinsu Kang, and Yoojae Won, “A Study on Variant Malware Detection Techniques using Static and Dynamic Features,” Journal of Information Processing Systems, vol. 16, no. 4, pp. 882-895, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Hamid Darabian et al., “Detecting Cryptomining Malware: Deep Learning Approach for Static and Dynamic Analysis,” Journal of Grid Computing, vol. 18, pp. 293-303, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[24] R.Surendiran, and K.Alagarsamy, “A Critical Approach for Intruder Detection in Mobile Devices,” SSRG International Journal of Computer Science and Engineering, vol. 1, no. 4, pp. 6-14, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Javier Bermejo Higuera et al., “Systematic Approach to Malware Analysis (SAMA),” Applied Science, vol. 10, no. 4, p. 1360, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[26] R C. Veena, and S H. Brahmananda, “A Significant Detection of APT using MD5 Hash Signature and Machine Learning Approach,” International Journal of Engineering Trends and Technology, vol. 70, no. 4, pp. 95-106, 2022.
[CrossRef] [Publisher Link]
[27] Sumit S. Lad, and Amol C. Adamuthe, “Improved Deep Learning Model for Static PE files malware Detection and Classification,” International Journal of Computer Network and Information Sciences, vol. 14, no. 2, pp. 14-26, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Pablo Duboue, The Art of Feature Engineering, Essentials for Machine Learning, Cambridge University Press, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Ravi Diwakar, Handling Imbalance Data with Imbalance-Learn in Python, Data Science Blogathon, 2022.
[Publisher Link]
[30] Alexsandro Parisi, Hands-on Artificial Intelligence for Cybersecurity, Implement Smart AI System for Preventing Cyber-Attacks and Detecting Threats and Network Anomalies, Packt Publishing, 2020.
[Google Scholar]
[31] Soma Halder, and Sinan Ozdemir, Hands-on Machine Learning for Cybersecurity, Safeguard your Systems but Making Your. Machines Intelligent Using the Python Ecosystem, Packt Publishing, 2018.
[Google Scholar]
[32] S. Sumathi et al., Advance Decision Sciences Based on Deep Learning Algorithms: A Practical Approach Using Python, Computer Science, Technology and Applications, Nova Science Publishers, 2020.
[CrossRef] [Publisher Link]