Comparative Analysis of Web Scraping Tools for Low-Resource Language Text

Navroz Kaur Kahlon; Williamjeet Singh

doi:https://doi.org/10.14445/22315381/IJETT-V72I1P128

Research Article | Open Access | Download PDF

Volume 72 | Issue 1 | Year 2024 | Article Id. IJETT-V72I1P128 | DOI : https://doi.org/10.14445/22315381/IJETT-V72I1P128

Comparative Analysis of Web Scraping Tools for Low-Resource Language Text

Navroz Kaur Kahlon, Williamjeet Singh

Received	Revised	Accepted	Published
17 Aug 2023	18 Nov 2023	23 Dec 2023	07 Jan 2024

Citation :

Navroz Kaur Kahlon, Williamjeet Singh, "Comparative Analysis of Web Scraping Tools for Low-Resource Language Text," International Journal of Engineering Trends and Technology (IJETT), vol. 72, no. 1, pp. 284-299, 2024. Crossref, https://doi.org/10.14445/22315381/IJETT-V72I1P128

Abstract

Introduction: Over the past few years, the accessibility of information on the internet has increased the availability of data in multiple languages. Several web scraping methodologies and tools have been developed; however, the scraping of “low resource” language text has not been emphasized vigorously. Objective: This paper presents a circumstantial comparison between various scraping tools while scraping from different Punjabi language text-based websites. Methods: Three Python-based and two desktop-based commercial tools have been considered for evaluation. The evaluation framework for comparing these tools includes performance, ease of use and reliability. The resultant comparison is done based on various parameters like runtime, memory usage, GitHub metrics, complexity metrics, etc. Result: While all tools are popular and viable in scraping content from the web, python-based tools give better results in terms of performance as they are customized according to the current structure of the web page. Conclusion: The paper will be useful for readers of both programming and non-programming backgrounds, as the qualities of both types of tools are discussed in detail.

Keywords

Desktop tools, Evaluation parameters, Punjabi, Python, Web scraping.

References

[1] Seppe vanden Broucke, and Bart Baesens, Practical Web Scraping for Data Science, pp. 155–172, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Poojitha Thota, and Elmasri Ramez, “Web Scraping of COVID-19 News Stories to Create Datasets for Sentiment and Emotion Analysis,” Proceedings of the 14th PErvasive Technologies Related to Assistive Environments Conference, pp. 306–314, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Lucy Linder et al., “Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German,” Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2706–2711, 2020.
[Google Scholar] [Publisher Link]
[4] Fantahun Gereme et al., “Combating Fake News in ‘Low-Resource’ Languages: Amharic Fake News Detection Accompanied by Resource Crafting,” Information, vol. 12, no. 1, pp. 1–9, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Navdeep Singh et al., “DeepSpacy-NER: An Efficient Deep Learning Model for Named Entity Recognition for Punjabi Language,” Evolving Systems, vol. 14, pp. 673-683, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Gurjot Singh Mahi, and Amandeep Verma, “Development of Focused Crawlers for Building Large Punjabi News Corpus,” Journal of ICT Research and Applications, vol. 15, no. 3, pp. 205–215, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Moaiad Ahmad Khder, “Web Scraping or Web Crawling : State of Art, Techniques, Approaches and Application,” International Journal of Advances in Soft Computing and its Application, vol. 13, no. 3, pp. 144-168, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Daniel Glez-Peña et al., “Web Scraping Technologies in an API World,” Briefings in Bioinformatics, vol. 15, no. 5, pp. 788–797, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Rohmat Gunawan et al., “Comparison of Web Scraping Techniques : Regular Expression, HTML DOM and Xpath,” Proceedings of the 2018 International Conference on Industrial Enterprise and System Engineering, vol. 2, pp. 283–287, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[10] SCM de S. Sirisuriya, “A Comparative Study on Web Scraping,” 8 th International Research Conference, pp. 135–140, 2015.
[Google Scholar] [Publisher Link]
[11] Y. Yang, L.T. Wilson, and J. Wang, “Development of an Automated Climatic Data Scraping, Filtering and Display System,” Computers and Electronics in Agriculture, vol. 71, no. 1, pp. 77–87, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Charmaine Bonifacio et al., “CCDST: A Free Canadian Climate Data Scraping Tool,” Computers and Geosciences, vol. 75, pp. 13–16, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Tony Grubesic, and Matthew Zook, “A Ticket to Ride: Evolving Landscapes of Air Travel Accessibility in the United States,” Journal of Transport Geography, vol. 15, no. 6, pp. 417–430, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Wendy Fangyu Hsu, “Mapping the Kominas’ Sociomusical Transnation: Punk, Diaspora, and Digital Media,” Asian Journal of Communication, vol. 23, no. 4, pp. 386–402, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Lefteris Angelis, Nick Bassiliades, and Yannis Manolopoulos, “On the Necessity of Multiple University Rankings,” COLLNET Journal of Scientometrics and Information Management, vol. 13, no. 1, pp. 11–36, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Kenneth Reitz, Requests Package, Python-Requests. [Online]. Availabe: https://docs.python-requests.org/en/master/
[17] Pandas. [Online]. Availabe: https://pandas.pydata.org/
[18] McNamara, XlsWriter. [Online]. Availabe: https://xlsxwriter.readthedocs.io/
[19] Isabelle Krebs et al., “Non-Journalistic Competitors of News Media Brands on Google and Youtube : From Solid Competition to a Liquid Media Market,” Journal of Media Business Studies, vol. 18, pp. 27-44, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Beautiful Soup Documentation, Crummy. [Online]. Availabe: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
[21] Saram Han, and Christopher K. Anderson, “Web Scraping for Hospitality Research: Overview, Opportunities, and Implications,” Cornell Hospitality Quarterly, vol. 62, no. 1, pp. 89–104, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Nick H.K. Or, “How Policy Agendas Change when Autocracies Liberalize: The Case of Hong Kong, 1975–2016,” Public Administration, vol. 97, no. 4, pp. 926–941, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Steffen Ganghof et al., “Do Minority Cabinets Govern More Flexibly and Inclusively? Evidence from Germany,” German Politics, vol. 28, no. 4, pp. 541–561, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Giorgia Chinazzo, “Investigating the Indoor Environmental Quality of Different Workplaces through Web-Scraping and Text-Mining of Glassdoor Reviews,” Building Research and Information, vol. 49, no. 6, pp. 695–713, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Hadley Wickham, Rvest. [Online]. Availabe: https://rvest.tidyverse.org/
[26] Lukasz Wiechetek, Kongkiti Phusavat, and Zbigniew Pastuszak, “An Analytical System for Evaluating Academia Units Based on Metrics Provided by Academic Social Network,” Expert Systems with Applications, vol. 159, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Alex Bradley, and Richard J.E. James, “Web Scraping Using R,” Advances in Methods and Practices in Psychological Science, vol. 2, no. 3, pp. 264–270, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Karina Sokolova, and Charles Perez, “The Digital Ingredients of Donation-Based Crowdfunding. A Data-Driven Study of Leetchi Projects and Social Campaigns Data-Driven Study of Leetchi Projects and Social Campaigns,” Journal of Decision Systems, vol. 27, no. 3, pp. 146–186, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Ueli Reber, “Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora,” Communication Methods and Measures, vol. 13, no. 2, pp. 102–125, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Vidhi Singrodia, Anirban Mitra, and Subrata Paul, “A Review on Web Scrapping and its Applications,” International Conference on Computer Communication and Informatics, pp. 1–6, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Manish Kumar, Rajesh Bhatia, and Dhavleesh Rattan, “A Survey of Web Crawlers for Information Retrieval,” Wiley Interdisciplinary Reviews Data Mining Knowledge Discovery, vol. 7, no. 6, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Irvin Dongo et al., “A Qualitative and Quantitative Comparison between Web Scraping and API Methods for Twitter Credibility Analysis,” International Journal of Web Information Systems, vol. 17, no. 6, pp. 580–606, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Jay M. Patel, Getting Structured Data from the Internet, pp. 1-397, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Giulio Barcaroli et al., “Use of Web Scraping and Text Mining Techniques in the Istat Survey on ‘Information and Communication Technology in Enterprises,’” European Conference on Quality in Official Statistics, pp. 1–13, 2014.
[Google Scholar] [Publisher Link]
[35] Katy Jordan, “Validity, Reliability, and the Case for Participant-Centered Research: Reflections on a Multi-Platform Social Media Study,” International Journal of Human–Computer Interaction, vol. 34, no. 10, pp. 913–921, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Freia McGregor et al., “Social Media Use by Patients with Glaucoma: What Can we Learn?,” Ophthalmic Physiol. Optics, vol. 34, no. 1, pp. 46–52, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Chinonso E. Etumnu et al., “Does the Distribution of Ratings Affect Online Grocery Sales? Evidence from Amazon,” Agribusiness, vol. 36, no. 4, pp. 501–521, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Tamar Wilner, and Avery Holton, “Breast Cancer Prevention and Treatment: Misinformation on Pinterest, 2018,” American Journal of Public Health, vol. 110, pp. S300–S304, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Jiwon Ryu, and Gerard Kim, “Interchanging the Mode of Display Between Desktop and Immersive Headset for Effective and Usable On-line Learning,” International Conference on Intelligent Human Computer Interaction, vol. 12615, pp. 218-222, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Every Day Our Web Scraping Solutions Turn Millions of Web Pages into Data for Sales Marketing Finance Operations, Mozenda. [Online]. Availabe: www.mozenda.com
[41] SCM De S Sirisuriya, “A Comparative Study on Web Scraping,” Proceedings of 8th International Research Conference, KDU, pp. 135– 140, 2015.
[Google Scholar] [Publisher Link]
[42] Mihai Gheorghe, Florin-Cristian Mihai, and Marian Dârdală, “Modern Techniques of Web Scraping for Data Scientists,” Revista Romana de Interactiune Om-Calculator, vol. 11, no. 1, pp. 63–75, 2018.
[Google Scholar] [Publisher Link]
[43] Time Access and Conversions, Python:Time. [Online]. Availabe: https://docs.python.org/3/library/time.html
[44] Psutil 5.9.7, Psutil. [Online]. Availabe: https://pypi.org/project/psutil/
[45] Mprofile 0.0.15, Mprofile. [Online]. Availabe: https://pypi.org/project/mprofile/
[46] Tim Gilboy, “Maintainability Index - What is it and Where does it Fall Short?,” Sourcery, 2022. [Online]. Available: https://sourcery.ai/blog/maintainability-index/
[47] Alex Luscombe, Kevin Dick, and Kevin Walby, “Algorithmic Thinking in the Public Interest: Navigating Technical, Legal, and Ethical Hurdles to Web Scraping in the Social Sciences,” Quality and Quantity, vol. 56, pp. 1023-1044, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Build and Run Your Web Spiders, Scrapy. [Online]. Availabe: https://scrapy.org/
[49] Christof Ebert et al., “Cyclomatic Complexity,” IEEE Software, vol. 33, no. 6, pp. 27–29, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[50] Huy Nguyen et al., “Exploring Metrics for the Analysis of Code Submissions in an Introductory Data Science Course,” ACM 11th International Learning Analytics and Knowledge Conference, pp. 632–638, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[51] Neelam Singh et al., “µBIGMSA-Microservice-Based Model for Big Data Knowledge Discovery: Thinking Beyond the Monoliths,” Wireless Personal Communications, vol. 116, no. 4, pp. 2819–2833, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[52] Andri Muhyidin, Muhammad Adi Febri Setiawan, and Nurkhamid, “Developing UNYSA Chatbot as Information Services about Yogyakarta State University,” Journal of Physics: Conference Series, vol. 1737, pp. 1-9, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[53] Anita Ganpati, Arvind Kalia, and Hardeep Singh, “A Comparative Study of Maintainability Index of Open Source Software,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 10, pp. 228–230, 2012.
[Google Scholar] [Publisher Link]
[54] Radon’s Documentation, Radon. [Online]. Availabe: https://radon.readthedocs.io/en/latest/
[55] Breno Santana Santos et al., “COVID-19 : A Scholarly Production Dataset Report for Research Analysis,” Data in Brief, vol. 32, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[56] Katharine Frazier, Hilary Davis, and John Vickery, “Seeing the Forest for Trees: Tools for Analyzing Faculty Research Output,” Serials Review, vol. 46, no. 3, pp. 184–189, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[57] Johannes I. Single, Jürgen Schmidt, and Jens Denecke, “Knowledge Acquisition from Chemical Accident Databases Using an Ontology- Based Method and Natural Language Processing,” Safety Science, vol. 129, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[58] Clément Nicolas, Jinwoo Kim, and Seokho Chi, “Natural Language Processing-Based Characterization of Top-Down Communication in Smart Cities for Enhancing Citizen Alignment,” Sustainable Cities and Society, vol. 66, pp. 1-15, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[59] Geoff Boeing, and Paul Waddell, “New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings,” Journal of Planning Education and Research, vol. 37, no. 4, pp. 457–476, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[60] Stephanie Lunn, Jia Zhu, and Monique Ross, “Utilizing Web Scraping and Natural Language Processing to Better Inform Pedagogical Practice,” Proceedings of IEEE Frontiers in Education Conference, pp. 1-9, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[61] Nichapat Sangkaew, and Hongrui Zhu, “Understanding Tourists’ Experiences at Local Markets in Phuket: An Analysis of TripAdvisor Reviews,” Journal of Quality Assurance in Hospitality & Tourism, vol. 23, no. 1, pp. 89–114, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[62] Haiyang Ai, and Xiaoye You, “The Grammatical Features of English in a Chinese Internet Discussion Forum,” World Englishes, vol. 34, no. 2, pp. 211–230, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[63] Elif Ensari, and Bilge Kobaş, “Web Scraping and Mapping Urban Data to Support Urban Design Decisions,” A/Z ITU Journal of Faculty Architect, vol. 15, no. 1, pp. 5–21, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[64] Kylie L. Anglin, “Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing,” Journal of Research on Educational Effectiveness, vol. 12, no. 4, pp. 685–706, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[65] Cristin M. Hall, Nicole C. Breeden, and Nicklaus Giacobe, “Gone Viral: Content Characteristics and Relative Quality of Highly Shared School Psychology-Related Content on Pinterest,” Psychology in the Schools, vol. 56, no. 6, pp. 959–976, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[66] Peter J. Franz et al., “Using Topic Modeling to Detect and Describe Self-Injurious and Related Content on a Large-Scale Digital Platform,” Suicide Life-Threatening Behavior, vol. 50, no. 1, pp. 5–18, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[67] Igor Mintz et al., “Individuals with Back and Neck Pain on Medical Forums: What do they Mention? What do they Fear?,” European Journal of Pain, vol. 24, no. 10, pp. 1915–1922, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[68] Laurence Sophie Jouaville, Tulika Paul, and Mariana Ferreira Almas, “A Review of the Sampling Methodology Used in Studies Evaluating the Effectiveness of Risk Minimisation Measures in Europe,” Pharmacoepidemiology and Drug Safety, vol. 30, no. 9, pp. 1143–1152, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[69] Burcu B. Keskin et al., “Cracking Sex Trafficking: Data Analysis, Pattern Recognition, and Path Prediction,” Production Operations Management, vol. 30, no. 4, pp. 1110–1135, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[70] Maria D. Molina, and S. Shyam Sundar, “Can Mobile Apps Motivate Fitness Tracking? A Study of Technological Affordances and Workout Behaviors,” Health Communication, vol. 35, no. 1, pp. 65–74, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[71] Dryscrape 1.0, Dryscrape. [Online]. Availabe: https://pypi.org/project/dryscrape
[72] plash - A Javascript Rendering service, Splash. [Online]. Availabe: https://splash.readthedocs.io/en/stable/
[73] T. McCabe, “Software Quality Metrics to Identify Risk,” 2008.
[Google Scholar]