Research Article | Open Access | Download PDF
Volume 73 | Issue 12 | Year 2025 | Article Id. IJETT-V73I12P103 | DOI : https://doi.org/10.14445/22315381/IJETT-V73I12P103A Longitudinal Study on the Evolution of YOLO Architectures: From YOLOv1 to YOLOv12
Rajaa Miftah, Abdessamad Belangour, Mostafa Hanoune, Sara Bouraya
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 05 Jul 2025 | 08 Nov 2025 | 17 Nov 2025 | 19 Dec 2025 |
Citation :
Rajaa Miftah, Abdessamad Belangour, Mostafa Hanoune, Sara Bouraya, "A Longitudinal Study on the Evolution of YOLO Architectures: From YOLOv1 to YOLOv12," International Journal of Engineering Trends and Technology (IJETT), vol. 73, no. 12, pp. 24-33, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I12P103
Abstract
Computer vision is a branch of artificial intelligence that allows machines to read and comprehend visual data in the world, like images and videos. Video tagging is a technique in computer vision that is used to find and label objects, actions, or scenes over multiple successive frames of a video without human input. The ability enables many applications such as surveillance, autonomous driving, content moderation, and massive video analysis. The YOLO (You Only Look Once) family of real-time object detectors is one of the approaches available, and it provides a powerful tradeoff between speed and accuracy that is essential to an effective video annotation. The paper will give a longitudinal analysis of YOLO models since version 1 all the way to version 12 with a view to how the design principles, the backbone architecture, and the feature fusion strategies of the models have changed over the years. The successive iterations had improvements to improve the precision of detection, the rate of computation, and the stability of the systems. The study is based on the history of detection techniques, beginning with simple grid-based schemes, all the way to current anchor-free and reparameterized models. The study explores how architectural innovations can be used to facilitate improved scalability and deployment to a variety of computing environments that go beyond edge devices into cloud services. The paper shows how YOLO was developed by illustrating these major improvements that justify its role in contemporary computer vision systems employed on real-time video tagging.
Keywords
Computer vision, Video Tagging, Yolo, One Stage detectors, Object Detection.
References
[1] Bhaumik Vaidya, and Chirag Paunwala, Deep
Learning Architectures for Object Detection and Classification, Smart
Techniques for a Smarter Planet, Springer, Cham, pp. 53-79, 2019.
[CrossRef] [Google
Scholar] [Publisher
Link]
[2] Licheng Jiao et al., “A Survey of Deep
Learning-Based Object Detection,” IEEE Access, vol. 7, pp.
128837-128868, 2019.
[CrossRef] [Google
Scholar] [Publisher Link]
[3] Iffat Zafar et al., Hands-On
Convolutional Neural Networks with Tensorflow: Solve Computer Vision Problems
with Modeling in Tensorflow and Python, Packt Publishing Ltd, 2018.
[Google
Scholar] [Publisher Link]
[4] Joseph Redmon et al., “You Only Look Once:
Unified, Real-Time Object Detection” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las
Vegas, NV, USA, pp. 779-788, 2016.
[CrossRef] [Google
Scholar] [Publisher Link]
[5] Joseph Redmon, and Ali Farhadi, “YOLO9000:
Better, Faster, Stronger,” 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
Honolulu, HI, USA, pp. 6517-6525, 2017.
[CrossRef] [Google
Scholar] [Publisher Link]
[6] Joseph Redmon, and Ali Farhadi, “YOLOv3: An
Incremental Improvement,” arXiv Preprint, 2018.
[CrossRef] [Google
Scholar] [Publisher Link]
[7] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan
Mark Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection,” arXiv
Preprint, 2020.
[CrossRef] [Google
Scholar] [Publisher Link]
[8] Chien-Yao Wang, Alexey Bochkovskiy, and
Hong-Yuan Mark Liao, “YOLOv7: Trainable Bag-of-Freebies Sets New
State-of-the-Art for Real-Time Object Detectors,” Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Taiwan, pp. 7464-7475, 2023.
[Google
Scholar] [Publisher
Link]
[9] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan
Mark Liao, “YOLOv9: Learning What You Want to Learn Using Programmable Gradient
Information,” Computer Vision - ECCV 2024, Springer,
Cham, pp. 1-21, 2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[10] Hui Chen
et al., “YOLOv10: Real-Time End-to-End Object Detection,” Neural Information Processing Systems Foundation, Inc.
(NeurIPS), Vancouver, Canada, pp. 107984-108011,
2024.
[CrossRef] [Google
Scholar] [Publisher Link]