From Data to Defense: Leveraging PhishTank and Multi
Source Hybrid Datasets for Intelligent Website-Level 
Phishing Protection

Jose C. Agoylo Jr.; Patrick D. Cerna

doi:https://doi.org/10.14445/22315381/IJETT-V74I1P106

Research Article | Open Access | Download PDF

Volume 74 | Issue 1 | Year 2026 | Article Id. IJETT-V74I1P106 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I1P106

From Data to Defense: Leveraging PhishTank and Multi Source Hybrid Datasets for Intelligent Website-Level Phishing Protection

Jose C. Agoylo Jr., Patrick D. Cerna

Received	Revised	Accepted	Published
18 Oct 2025	17 Dec 2025	25 Dec 2025	14 Jan 2026

Citation :

Jose C. Agoylo Jr., Patrick D. Cerna, "From Data to Defense: Leveraging PhishTank and Multi Source Hybrid Datasets for Intelligent Website-Level Phishing Protection," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 1, pp. 85-94, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I1P106

Abstract

Phishing remains one of the most common and most harmful cybercrimes, which skillfully exploits through the use of forged web portals, lures users into sharing some of the most sensitive information, some of which includes authentication credentials and financial details. Traditional defenses that were based on blacklist approaches have proven to be insufficient with time, as attackers explore new areas or create new URL channels to avoid detection systems. The current study, in turn, proposes an intelligent detection paradigm that combines hybrid datasets of PhishTank and Kaggle and, therefore, enhances robustness and generalizability. Originally, 590,280 different URLs were extracted and then put through a strict preprocessing program, out of which 159,289 were phishing and 430,991 were valid. After careful cleaning and stratified balancing, a filtered set of 100,000 URLs, half of which contain phishing and the other half legitimate examples, was collected to train the model. Three canonical machine learning algorithms were used: Logistic Regression, Random Forest, and XGBoost. Their output was compared to a set of standard measures, such as accuracy, precision, recall, F1-score, and ROC-AUC. Based on the empirical findings, all three classifiers possessed remarkable detection efficacy. Precisely, the Logistic Regression had the highest accuracy of 91.7 per cent, the random forest had 90.1 per cent, and the XGBoost had the highest, which was 92.7 per cent. Interestingly, XGBoost managed to beat the other models in all the assessment variables, scoring an ROC-AUC of 0.982 and significantly lowering the false-negative rate, which is a key attribute in this context to tackle the unseen phishing attacks. Despite the fact that the demonstrated accuracy of Logistic Regression was a bit lower, it had better computational efficiency and fast inference capabilities, which makes it a good choice in the context of lightweight and real-time deployment, like browser extensions. Although the performance of Random Forest was more predictable, it had a relatively lower precision and recall, thus its use was constrained in time-related detection. The findings are indicative of the critical role of hybrid datasets in the realm of phishing defense and that machine-learning frameworks represent a scalable, viable solution to protect users in an intelligent way at the level of a website.

Keywords

Hybrid dataset, Machine learning, Phishing detection, URL classification, XGBoost.

References

[1] Sunday Eric Adewumi, and Uchenna Daniel Ani, “Impact of Detection Accuracy Rates on Phishing Email Spikes: Towards more Effective Mitigation,” Information Security Journal: A Global Perspective, vol. 34, no. 4, pp. 354-391, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Maria Carla Calzarossa, Paolo Giudici, and Rasha Zieni, “Explainable Machine Learning for Phishing Feature Detection,” Quality and Reliability Engineering International, vol. 40, no. 1, pp. 362-373, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Gharbi Alshammari et al., “Hybrid Phishing Detection based on Automated Feature Selection using the Chaotic Dragonfly Algorithm,” Electronics, vol. 12, no. 13, pp. 1-14, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Mahdi Bahaghighat, Majid Ghasemi, and Figen Ozen, “A High-Accuracy Phishing Website Detection Method based on Machine Learning,” Journal of Information Security and Applications, vol. 77, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Abdul Basit et al., “A Comprehensive Survey of AI-Enabled Phishing Attack Detection Techniques,” Telecommunication Systems, vol. 76, no. 1, pp. 139-154, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Lázaro Bustio-Martínez et al., “A Lightweight Data Representation for Phishing URLs Detection in IoT Environments,” Information Sciences, vol. 603, pp. 42-59, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Cagatay Catal et al., “Applications of Deep Learning for Phishing Detection: A Systematic Literature Review,” Knowledge and Information Systems, vol. 64, no. 6, pp. 1457-1500, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Nguyet Quang Do et al., “Detection of Malicious URLs using Temporal Convolutional Network and Multi-Head Self-Attention Mechanism,” Applied Soft Computing, vol. 169, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Hayk Ghalechyan et al., “Phishing URL Detection with Neural Networks: An Empirical Study,” Scientific Reports, vol. 14, no. 1, pp. 1-12, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Bingyang Guo et al., “Hinphish: An Effective Phishing Detection Approach based on Heterogeneous Information Networks,” Applied Science, vol. 11, no. 20, pp. 1-19, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Qazi Emad ul Haq, Muhammad Hamza Faheem, and Iftikhar Ahmad, “Detecting Phishing URLs based on a Deep Learning Approach to Prevent Cyber-Attacks,” Applied Science, vol. 14, no. 22, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Katherine Haynes, Hossein Shirazi, and Indrakshi Ray, “Lightweight URL-based Phishing Detection using Natural Language Processing Transformers for Mobile Devices,” Procedia Computer Science, vol. 191, pp. 127-134, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[13] K.S. Jishnu, and B. Arthi, “Real-time Phishing URL Detection Framework using Knowledge Distilled ELECTRA,” Automatica, vol. 65, no. 4, pp. 1621-1639, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Wenhao Li et al., “A State-of-the-Art Review on Phishing Website Detection Techniques,” IEEE Access, vol. 12, pp. 187976-188012, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Kousik Barik, Sanjay Misra, and Raghini Mohan, “Web-based Phishing URL Detection Model using Deep Learning Optimization Techniques,” International Journal of Data Science and Analytics, vol. 20, no. 5, pp. 4449-4471, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Samuel Marchal et al., “Know Your Phish: Novel Techniques for Detecting Phishing Sites and their Targets,” 2016 IEEE 36^thInternational Conference on Distributed Computing Systems (ICDCS), Nara, Japan, pp. 323-333, 2016.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Valentine Adeyemi Onih, “Phishing Detection using Machine Learning: Model Development and Integration,” International Journal of Scientific and Management Research, vol. 7, no. 4, pp. 27-63, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Chidimma Opara, Yingke Chen, and Bo Wei, “Look before you leap: Detecting Phishing Web Pages by Exploiting Raw URL and HTML Characteristics,” Expert Systems with Applications, vol. 236, pp. 1-13, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[19] PhishTank, Phishing Activity Trends Report (4^th quarter 2023), Anti-Phishing Working Group, 2023. [Online]. Available: https://docs.apwg.org/reports/apwg_trends_report_q4_2023.pdf

[20] Arvind Prasad, and Shalini Chandra, “PhiUSIIL: A Diverse Security Profile Empowered Phishing URL Detection Framework based on Similarity Index and Incremental Learning,” Computers and Security, vol. 136, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[21] Sumitra Das Guptta et al., “Modeling Hybrid Feature-based Phishing Websites Detection using Machine Learning Techniques,” Annals of Data Science, vol. 11, no. 1, pp. 217-242, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[22] Routhu Srinivasa Rao et al., “A Hybrid Super Learner Ensemble for Phishing Detection on Mobile Devices,” Scientific Reports, vol. 15, no. 1, pp. 1-17, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[23] Fariza Rashid et al., “Phishing URL Detection Generalisation using Unsupervised Domain Adaptation,” Computer Networks, vol. 245, pp. 1-14, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[24] Mohamed Abdelkarim Remmide et al., “Detection of Phishing URLs using Temporal Convolutional Network,” Procedia Computer Science, vol. 212, pp. 74-82, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[25] Ozgur Koray Sahingoz et al., “Machine Learning-based Phishing Detection from URLs,” Expert Systems with Applications, vol. 117, pp. 345-357, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[26] Doyen Sahoo, Chenghao Liu, and Steven C.H. Hoi, “Malicious URL Detection using Machine Learning: A Survey,” arXiv Preprint, pp. 1-37, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[27] M. Vijayalakshmi et al., “Web Phishing Detection Techniques: A Survey on the State-of-the-Art, Taxonomy and Future Directions,” IET Networks, vol. 9, no. 5, pp. 235-246, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[28] Lizhen Tang, and Qusay H. Mahmoud, “Survey of Machine-Learning-based Solutions for Phishing Website Detection,” Machine Learning and Knowledge Extraction, vol. 3, no. 3, pp. 672-694, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[29] Suleiman Y. Yerima, and Mohammed K. Alzaylaee, “High Accuracy Phishing Detection based on Convolutional Neural Networks,” 2020 3^rd International Conference on Computer Applications and Information Security (ICCAIS), Riyadh, Saudi Arabia, pp. 1-6, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[30] Grega Vrbančič, Iztok Fister Jr, and Vili Podgorelec, “Datasets for Phishing Websites Detection,” Data in Brief, vol. 33, pp. 1-7, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[31] Maruf Ahmed Tamal et al., “Dataset of Suspicious Phishing URL Detection,” Frontiers in Computer Science, vol. 6, pp. 1-9, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[32] Avisha Das et al., “SOK: A Comprehensive Reexamination of Phishing Research from the Security Perspective,” IEEE Communications Surveys and Tutorials, vol. 22, no. 1, pp. 671-708, 2020.
[CrossRef] [Google Scholar] [Publisher Link]