FSM-MR++ Enhanced MapReduce-Based Framework for Scalable Frequent Subgraph Mining in Large-Scale Graph Datasets

FSM-MR++ Enhanced MapReduce-Based Framework for Scalable Frequent Subgraph Mining in Large-Scale Graph Datasets

  IJETT-book-cover           
  
© 2025 by IJETT Journal
Volume-73 Issue-8
Year of Publication : 2025
Author : Naga Mallik Atcha, Jagannadha Rao D B, Vijayakumar Polepally
DOI : 10.14445/22315381/IJETT-V73I8P113

How to Cite?
Naga Mallik Atcha, Jagannadha Rao D B, Vijayakumar Polepally,"FSM-MR++ Enhanced MapReduce-Based Framework for Scalable Frequent Subgraph Mining in Large-Scale Graph Datasets", International Journal of Engineering Trends and Technology, vol. 73, no. 8, pp.147-169, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I8P113

Abstract
Frequent Subgraph Mining (FSM) is a key technique for identifying recurring structural patterns in large graph datasets. It has a broad spectrum of applications ranging from bioinformatics and cheminformatics to social network analysis. Nonetheless, state-of-the-art FSM algorithms, such as gSpan, still suffer from a scalability problem stemming from the exponential candidate generation and the in-memory constraint. Although existing distributed methods (such as FSM-MR) are MapReduce-friendly, they suffer from static reducer assignment, redundant subgraph recomputation, and inefficiency in pruning. Such constraints significantly influence the run-time efficiency, quality of patterns, and scalability on large-scale real-world big data. To address these challenges, an enhanced MapReduce-based pattern searching framework, FSM-MR++, is proposed for efficient and adaptive frequent subgraph discovery. The framework incorporates three innovations: a hybrid caching technique to prune repetitive subgraph generation, a dynamic reducer configuration scheme to prevent skewed task distribution, and a density-aware pruning strategy to abandon unpromising candidates early during mining. These primitives are combined with canonical Labeling and cost-aware partitioning to optimize parallelism and convergence. A quantitative evaluation is conducted to assess its performance on both synthetic and real datasets. It was observed that FSM-MR++ runs up to 30% faster than FSM-MR and is up to half the run-time of centralized gSpan, while still enabling high-quality, interpretable patterns. We conduct a series of ablation studies to validate the improvements introduced by each proposed enhancement, and scalability tests are used to evaluate the framework against growing data and cluster sizes. The framework’s efficacy in discovering target-specific substructures in a domain and in an interpretable manner is demonstrated using real-world, PubChemBioAssay, and DBLP datasets. In general, FSM-MR++ fills the gaps of current FSM methods, providing a scalable, efficient, and flexible solution for frequent subgraph mining in a distributed big data setting.

Keywords
- Frequent Subgraph Mining, MapReduce framework, Distributed graph Mining, hybrid caching, Dynamic reducer scaling.

References
[1] Da Yan et al., “PrefixFPM: A Parallel Framework for General-Purpose Mining of Frequent and Closed Patterns,” The VLDB Journal, vol. 31, no. 2, pp. 253-286, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Da Yan et al., “Systems for Scalable Graph Analytics and Machine Learning: Trends and Methods,” KDD ‘24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, pp. 6627-6632, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Maciej Besta, and Torsten Hoefler, “Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 2584-2606, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Walid Megherbi, Mohammed Haddad, and Hamida Seba, “DeepDense: Enabling Node Embedding for Dense Subgraph Mining,” Expert Systems with Applications, vol. 238, pp. 1-37, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Mingji Yang et al., “Efficient Algorithms for Personalized PageRank Computation: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 9, pp. 4582-4602, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Hans Vandierendonck, “Differentiating Set Intersections in Maximal Clique Enumeration by Function and Subproblem Size,” ICS ‘24: Proceedings of the 38th ACM International Conference on Supercomputing, Kyoto, Japan, pp. 150-163, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Weizheng Lu et al., “Xorbits: Automating Operator Tiling for Distributed Data Science,” 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, Netherlands, pp. 5211-5223, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Chuchu Gao et al., “CSM-TopK: Continuous Subgraph Matching with TopK Density Constraints,” 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, Netherlands, pp. 3084-3097, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Sivakumar Ponnusamy, and Pankaj Gupta, “Scalable Data Partitioning Techniques for Distributed Data Processing in Cloud Environments: A Review,” IEEE Access, vol. 12, pp. 26735-26746, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Kai Yao, Lijun Chang, and Jeffrey Xu Yu, “Identifying Similar Bicliques in Bipartite Graphs,” The VLDB Journal, vol. 33, no. 3, pp. 703-726, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Ziqiang Yu et al., “A Distributed Solution for Efficient K Shortest Paths Computation Over Dynamic Road Networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 2759-2773, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Bo Wei et al., “Construct and Query A Fine-Grained Geospatial Knowledge Graph,” Data Science and Engineering, vol. 9, no. 2, pp. 152-176, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Qian Zhang, “Big Health Data for Elderly Employees Job Performance of Soes: Visionary and Enticing Challenges,” Multimedia Tools and Applications, vol. 83, no. 2, pp. 4409-4442, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Long Yuan et al., “Batch Hop-Constrained S-t Simple Path Query Processing in Large Graphs,” 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, Netherlands, pp. 2557-2569, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Nabanita Das et al., “Integrating Sentiment Analysis with Graph Neural Networks for Enhanced Stock Prediction: A Comprehensive Survey,” Decision Analytics Journal, vol. 10, pp. 1-21, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Quanquan C. Liu, and C. Seshadhri, “Brief Announcement: Improved Massively Parallel Triangle Counting in O (1) Rounds,” PODC '24: Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing, Nantes, France, pp. 519-522, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Johannes Erbel, and Jens Grabowskim, “Scientific Workflow Execution in the Cloud using a Dynamic Runtime Model,” Software and Systems Modeling, vol. 23, no. 1, pp. 163-193, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Mahnoor Imran Shafi et al., “A Review of Approaches for Rapid Data Clustering: Challenges, Opportunities and Future Directions,” IEEE Access, vol. 12, pp. 138086-138120, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Devendra Dahiphale, “Mapreduce for Graphs Processing: New Big Data Algorithm for 2-Edge Connected Components and Future Ideas,” IEEE Access, vol. 11, pp. 54986-55001, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Shubhangi Chaturvedi, Sri Khetwat Saritha, and Animesh Chaturvedi, “Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics,” 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW), Bangalore, India, pp. 168-175, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Yanyan Song et al., “Optimizing Subgraph Matching Over Distributed Knowledge Graphs Using Partial Evaluation,” World Wide Web, vol. 26, no. 2, pp. 751-771, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Changxi Ma, Mingxi Zhao, and Yongpeng Zhao, “An Overview of Hadoop Applications in Transportation Big Data,” Journal of Traffic and Transportation Engineering (English Edition), vol. 10, no. 5, pp. 900-917, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Bo Yan et al., “Graph Mining for Cybersecurity: A Survey,” ACM Transactions on Knowledge Discovery from Data, vol. 18, no. 2, pp. 1-50, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Makhan Kumbhkar et al., “Dimensional Reduction Method based on Big Data Techniques for Large Scale Data,” 2023 IEEE International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, pp. 1-7, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Harif Asmaa, Namir Abdelwahid, and Marzak Abdelaziz, “Approach to Reduce the Communication Cost When Partitioning a Big Graph,” Procedia Computer Science, vol. 220, pp. 1051-1056, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Ziwei Mo et al., “Distributed Truss Computation in Dynamic Graphs,” Tsinghua Science and Technology, vol. 28, no. 5, pp. 873-887, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[27] A. Srinivas Reddy et al., “Mining Subgraph Coverage Patterns from Graph Transactions,” International Journal of Data Science and Analytics, vol. 13, no. 2, pp. 105-121, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Xiaozhou Liu et al., “Practical Survey on MapReduce Subgraph Enumeration Algorithms,” International Conference on Emerging Internetworking, Data & Web Technologies, Okayama, Japan, vol. 1, pp. 430-444, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Andrea Pasini et al., “Semantic Image Collection Summarization with Frequent Subgraph Mining,” IEEE Access, vol. 10, pp. 131747-131764, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Tewodros Alemu Ayall et al., “Graph Computing Systems and Partitioning Techniques: A Survey,” IEEE Access, vol. 10, pp. 118523-118550, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Hanlin Zhang et al., “An Efficient Vertex-Driven Temporal Graph Model and Subgraph Clustering Method,” IEEE Access, vol. 10, pp. 100627-100645, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Chenhao Ma et al., “Efficient Directed Densest Subgraph Discovery,” ACM SIGMOD Record, vol. 50, no. 1, pp. 33-40, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Xin Wang, Yang Xu, and Huayi Zhan, “Extending Association Rules with Graph Patterns,” Expert Systems with Applications, vol. 141, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Guanqi Hua et al., “D-colSimulation: A Distributed Approach for Frequent Graph Pattern Mining based on colSimulation in a Single Large Graph,” 2020 IEEE International Conference on Services Computing (SCC), Beijing, China, pp. 76-93, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Carlos Eiras-Franco et al., “Fast Distributed kNN Graph Construction Using Auto-tuned Locality-sensitive Hashing,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 11, no. 6, pp. 1-18, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Wentian Guo, Yuchen Li, and Kian-Lee Tan, “Exploiting Reuse for GPU Subgraph Enumeration,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 9, pp. 4231-4244, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Shixuan Sun, and Qiong Luo, “In-Memory Subgraph Matching: An In-depth Study,” SIGMOD ‘20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland OR, USA, pp. 1083-1098, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Noujan Pashanasangi, and C. Seshadhri, “Efficiently Counting Vertex Orbits of All 5-Vertex Subgraphs, by Evoke,” WSDM ‘20: Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, pp. 447-455, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Ye Yuan et al., “Efficient Graph Query Processing over Geo-Distributed Datacenters,” SIGIR ‘20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, pp. 619-628, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[40] B.S.A.S. Rajita et al., “Spark-Based Parallel Method for Prediction of Events,” Arabian Journal for Science and Engineering, vol. 45, no. 4, pp. 3437-3453, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Naga Mallik Atcha, Jagannadha Rao D B, and Vijayakumar Polepally, “Optimized Frequent Subgraph Mining Using Iterative MapReduce for Enhanced Scalability and Performance,” Journal of Theoretical and Applied Information Technology, vol. 103, no. 5, pp. 1757-1780, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Yanli Wang et al., “PubChem’s BioAssay Database,” Nucleic Acids Research, vol. 40, no. D1, pp. D400-D412, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Michael Ley, “The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives,” International Symposium on String Processing and Information Retrieval, Lisbon, Portugal, pp. 1-10, 2002.
[CrossRef] [Google Scholar] [Publisher Link]
[44] Chris Stark et al., “BioGRID: A General Repository for Interaction Datasets,” Nucleic Acids Research, vol. 34, pp. D535-D539, 2006.
[CrossRef] [Google Scholar] [Publisher Link]