Research Article | Open Access | Download PDF
Volume 74 | Issue 1 | Year 2026 | Article Id. IJETT-V74I1P106 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I1P106From Data to Defense: Leveraging PhishTank and Multi Source Hybrid Datasets for Intelligent Website-Level Phishing Protection
Jose C. Agoylo Jr., Patrick D. Cerna
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 18 Oct 2025 | 17 Dec 2025 | 25 Dec 2025 | 14 Jan 2026 |
Citation :
Jose C. Agoylo Jr., Patrick D. Cerna, "From Data to Defense: Leveraging PhishTank and Multi Source Hybrid Datasets for Intelligent Website-Level Phishing Protection," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 1, pp. 85-94, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I1P106
Abstract
Phishing remains one of the most common and most harmful cybercrimes, which skillfully exploits through the use of forged web portals, lures users into sharing some of the most sensitive information, some of which includes authentication credentials and financial details. Traditional defenses that were based on blacklist approaches have proven to be insufficient with time, as attackers explore new areas or create new URL channels to avoid detection systems. The current study, in turn, proposes an intelligent detection paradigm that combines hybrid datasets of PhishTank and Kaggle and, therefore, enhances robustness and generalizability. Originally, 590,280 different URLs were extracted and then put through a strict preprocessing program, out of which 159,289 were phishing and 430,991 were valid. After careful cleaning and stratified balancing, a filtered set of 100,000 URLs, half of which contain phishing and the other half legitimate examples, was collected to train the model. Three canonical machine learning algorithms were used: Logistic Regression, Random Forest, and XGBoost. Their output was compared to a set of standard measures, such as accuracy, precision, recall, F1-score, and ROC-AUC. Based on the empirical findings, all three classifiers possessed remarkable detection efficacy. Precisely, the Logistic Regression had the highest accuracy of 91.7 per cent, the random forest had 90.1 per cent, and the XGBoost had the highest, which was 92.7 per cent. Interestingly, XGBoost managed to beat the other models in all the assessment variables, scoring an ROC-AUC of 0.982 and significantly lowering the false-negative rate, which is a key attribute in this context to tackle the unseen phishing attacks. Despite the fact that the demonstrated accuracy of Logistic Regression was a bit lower, it had better computational efficiency and fast inference capabilities, which makes it a good choice in the context of lightweight and real-time deployment, like browser extensions. Although the performance of Random Forest was more predictable, it had a relatively lower precision and recall, thus its use was constrained in time-related detection. The findings are indicative of the critical role of hybrid datasets in the realm of phishing defense and that machine-learning frameworks represent a scalable, viable solution to protect users in an intelligent way at the level of a website.
Keywords
Hybrid dataset, Machine learning, Phishing detection, URL classification, XGBoost.
References
[1] Sunday
Eric Adewumi, and Uchenna Daniel Ani, “Impact of Detection Accuracy Rates on
Phishing Email Spikes: Towards more Effective Mitigation,” Information
Security Journal: A Global Perspective, vol. 34, no. 4, pp. 354-391, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Maria Carla Calzarossa, Paolo
Giudici, and Rasha Zieni, “Explainable Machine Learning for Phishing Feature
Detection,” Quality and Reliability Engineering International, vol. 40,
no. 1, pp. 362-373, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Gharbi Alshammari et al., “Hybrid
Phishing Detection based on Automated Feature Selection using the Chaotic
Dragonfly Algorithm,” Electronics, vol. 12, no. 13, pp. 1-14, 2023.
[CrossRef] [Google Scholar] [Publisher
Link]
[4] Mahdi Bahaghighat, Majid Ghasemi, and
Figen Ozen, “A High-Accuracy Phishing Website Detection Method based on Machine
Learning,” Journal of Information Security and Applications, vol. 77,
2023.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Abdul Basit et al., “A Comprehensive
Survey of AI-Enabled Phishing Attack Detection Techniques,” Telecommunication
Systems, vol. 76, no. 1, pp. 139-154, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Lázaro Bustio-Martínez et al., “A
Lightweight Data Representation for Phishing URLs Detection in IoT
Environments,” Information Sciences, vol. 603, pp. 42-59, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Cagatay Catal et al., “Applications
of Deep Learning for Phishing Detection: A Systematic Literature Review,” Knowledge
and Information Systems, vol. 64, no. 6, pp. 1457-1500, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Nguyet Quang Do et al., “Detection of
Malicious URLs using Temporal Convolutional Network and Multi-Head
Self-Attention Mechanism,” Applied Soft Computing, vol. 169, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Hayk Ghalechyan et al., “Phishing URL
Detection with Neural Networks: An Empirical Study,” Scientific Reports,
vol. 14, no. 1, pp. 1-12, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Bingyang
Guo et al., “Hinphish: An Effective Phishing Detection Approach based on
Heterogeneous Information Networks,” Applied Science, vol. 11, no. 20,
pp. 1-19, 2021.
[CrossRef] [Google Scholar] [Publisher
Link]
[11] Qazi
Emad ul Haq, Muhammad Hamza Faheem, and Iftikhar Ahmad, “Detecting Phishing
URLs based on a Deep Learning Approach to Prevent Cyber-Attacks,” Applied
Science, vol. 14, no. 22, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher
Link]
[12] Katherine
Haynes, Hossein Shirazi, and Indrakshi Ray, “Lightweight URL-based Phishing
Detection using Natural Language Processing Transformers for Mobile Devices,” Procedia
Computer Science, vol. 191, pp. 127-134, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[13] K.S.
Jishnu, and B. Arthi, “Real-time Phishing URL Detection Framework using
Knowledge Distilled ELECTRA,” Automatica, vol. 65, no. 4, pp. 1621-1639,
2024.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Wenhao
Li et al., “A State-of-the-Art Review on Phishing Website Detection
Techniques,” IEEE Access, vol. 12, pp. 187976-188012, 2024.
[CrossRef] [Google Scholar] [Publisher
Link]
[15] Kousik
Barik, Sanjay Misra, and Raghini Mohan, “Web-based Phishing URL Detection Model
using Deep Learning Optimization Techniques,” International Journal of Data
Science and Analytics, vol. 20, no. 5, pp. 4449-4471, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Samuel
Marchal et al., “Know Your Phish: Novel Techniques for Detecting Phishing Sites
and their Targets,” 2016 IEEE 36th International Conference on
Distributed Computing Systems (ICDCS), Nara, Japan, pp. 323-333, 2016.
[CrossRef] [Google Scholar] [Publisher
Link]
[17] Valentine
Adeyemi Onih, “Phishing Detection using Machine Learning: Model Development and
Integration,” International Journal of Scientific
and Management Research,
vol. 7, no. 4, pp. 27-63, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Chidimma
Opara, Yingke Chen, and Bo Wei, “Look before you leap: Detecting Phishing Web
Pages by Exploiting Raw URL and HTML Characteristics,” Expert Systems with
Applications, vol. 236, pp. 1-13, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[19] PhishTank,
Phishing Activity Trends Report (4th quarter 2023), Anti-Phishing
Working Group, 2023. [Online]. Available:
https://docs.apwg.org/reports/apwg_trends_report_q4_2023.pdf
[20] Arvind
Prasad, and Shalini Chandra, “PhiUSIIL: A Diverse Security Profile Empowered
Phishing URL Detection Framework based on Similarity Index and Incremental
Learning,” Computers and Security, vol. 136, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Sumitra
Das Guptta et al., “Modeling Hybrid Feature-based Phishing Websites Detection
using Machine Learning Techniques,” Annals of Data Science, vol.
11, no. 1, pp. 217-242, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Routhu
Srinivasa Rao et al., “A Hybrid Super Learner Ensemble for Phishing Detection
on Mobile Devices,” Scientific Reports, vol. 15, no. 1, pp. 1-17, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Fariza
Rashid et al., “Phishing URL Detection Generalisation using Unsupervised Domain
Adaptation,” Computer Networks, vol. 245, pp. 1-14, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Mohamed
Abdelkarim Remmide et al., “Detection of Phishing URLs using Temporal
Convolutional Network,” Procedia Computer Science, vol. 212, pp. 74-82,
2022.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Ozgur
Koray Sahingoz et al., “Machine Learning-based Phishing Detection from URLs,” Expert
Systems with Applications, vol. 117, pp. 345-357, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Doyen
Sahoo, Chenghao Liu, and Steven C.H. Hoi, “Malicious URL Detection using
Machine Learning: A Survey,” arXiv Preprint, pp. 1-37, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[27] M.
Vijayalakshmi et al., “Web Phishing Detection Techniques: A Survey on the
State-of-the-Art, Taxonomy and Future Directions,” IET Networks, vol. 9,
no. 5, pp. 235-246, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Lizhen
Tang, and Qusay H. Mahmoud, “Survey of Machine-Learning-based Solutions for
Phishing Website Detection,” Machine Learning and Knowledge Extraction,
vol. 3, no. 3, pp. 672-694, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Suleiman
Y. Yerima, and Mohammed K. Alzaylaee, “High Accuracy Phishing Detection based
on Convolutional Neural Networks,” 2020 3rd International
Conference on Computer Applications and Information Security (ICCAIS),
Riyadh, Saudi Arabia, pp. 1-6, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[30] Grega
Vrbančič, Iztok Fister Jr, and Vili Podgorelec, “Datasets for
Phishing Websites Detection,” Data in Brief, vol. 33, pp. 1-7, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Maruf
Ahmed Tamal et al., “Dataset of Suspicious Phishing URL Detection,” Frontiers
in Computer Science, vol. 6, pp. 1-9, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Avisha
Das et al., “SOK: A Comprehensive Reexamination of Phishing Research from the
Security Perspective,” IEEE
Communications Surveys and Tutorials, vol. 22, no. 1, pp. 671-708, 2020.
[CrossRef] [Google Scholar] [Publisher Link]