International Journal of Engineering
Trends and Technology

Research Article | Open Access | Download PDF
Volume 73 | Issue 12 | Year 2025 | Article Id. IJETT-V73I12P103 | DOI : https://doi.org/10.14445/22315381/IJETT-V73I12P103

A Longitudinal Study on the Evolution of YOLO Architectures: From YOLOv1 to YOLOv12


Rajaa Miftah, Abdessamad Belangour, Mostafa Hanoune, Sara Bouraya

Received Revised Accepted Published
05 Jul 2025 08 Nov 2025 17 Nov 2025 19 Dec 2025

Citation :

Rajaa Miftah, Abdessamad Belangour, Mostafa Hanoune, Sara Bouraya, "A Longitudinal Study on the Evolution of YOLO Architectures: From YOLOv1 to YOLOv12," International Journal of Engineering Trends and Technology (IJETT), vol. 73, no. 12, pp. 24-33, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I12P103

Abstract

Computer vision is a branch of artificial intelligence that allows machines to read and comprehend visual data in the world, like images and videos. Video tagging is a technique in computer vision that is used to find and label objects, actions, or scenes over multiple successive frames of a video without human input. The ability enables many applications such as surveillance, autonomous driving, content moderation, and massive video analysis. The YOLO (You Only Look Once) family of real-time object detectors is one of the approaches available, and it provides a powerful tradeoff between speed and accuracy that is essential to an effective video annotation. The paper will give a longitudinal analysis of YOLO models since version 1 all the way to version 12 with a view to how the design principles, the backbone architecture, and the feature fusion strategies of the models have changed over the years. The successive iterations had improvements to improve the precision of detection, the rate of computation, and the stability of the systems. The study is based on the history of detection techniques, beginning with simple grid-based schemes, all the way to current anchor-free and reparameterized models. The study explores how architectural innovations can be used to facilitate improved scalability and deployment to a variety of computing environments that go beyond edge devices into cloud services. The paper shows how YOLO was developed by illustrating these major improvements that justify its role in contemporary computer vision systems employed on real-time video tagging.

Keywords

Computer vision, Video Tagging, Yolo, One Stage detectors, Object Detection.

References

[1] Bhaumik Vaidya, and Chirag Paunwala, Deep Learning Architectures for Object Detection and Classification, Smart Techniques for a Smarter Planet, Springer, Cham, pp. 53-79, 2019.
[
CrossRef] [Google Scholar] [Publisher Link]

[2] Licheng Jiao et al., “A Survey of Deep Learning-Based Object Detection,” IEEE Access, vol. 7, pp. 128837-128868, 2019.
[
CrossRef] [Google Scholar] [Publisher Link]

[3] Iffat Zafar et al., Hands-On Convolutional Neural Networks with Tensorflow: Solve Computer Vision Problems with Modeling in Tensorflow and Python, Packt Publishing Ltd, 2018.
[
Google Scholar] [Publisher Link]

[4] Joseph Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 779-788, 2016.
[
CrossRef] [Google Scholar] [Publisher Link]

[5] Joseph Redmon, and Ali Farhadi, “YOLO9000: Better, Faster, Stronger,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 6517-6525, 2017.
[
CrossRef] [Google Scholar] [Publisher Link]

[6]  Joseph Redmon, and Ali Farhadi, “YOLOv3: An Incremental Improvement,” arXiv Preprint, 2018.
[
CrossRef] [Google Scholar] [Publisher Link]

[7] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection,” arXiv Preprint, 2020.
[
CrossRef] [Google Scholar] [Publisher Link]

[8] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao, “YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Taiwan, pp. 7464-7475, 2023.
[
Google Scholar] [Publisher Link]

[9] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao, “YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information,” Computer Vision - ECCV 2024, Springer, Cham, pp. 1-21, 2024.
[
CrossRef] [Google Scholar] [Publisher Link]

[10] Hui Chen et al., “YOLOv10: Real-Time End-to-End Object Detection,” Neural Information Processing Systems Foundation, Inc. (NeurIPS), Vancouver, Canada, pp. 107984-108011, 2024.
[
CrossRef] [Google Scholar] [Publisher Link]