Improve Data Text Quality by Applying Text Pre-Processing Method (Case Study)

Improve Data Text Quality by Applying Text Pre-Processing Method (Case Study)

  IJETT-book-cover           
  
© 2023 by IJETT Journal
Volume-71 Issue-1
Year of Publication : 2023
Author : Rizky Dwi Novyantika, Sani Muhamad Isa
DOI : 10.14445/22315381/IJETT-V71I1P209

How to Cite?

Rizky Dwi Novyantika, Sani Muhamad Isa, "Improve Data Text Quality by Applying Text Pre-Processing Method (Case Study)," International Journal of Engineering Trends and Technology, vol. 71, no. 1, pp. 94-108, 2023. Crossref, https://doi.org/10.14445/22315381/IJETT-V71I1P209

Abstract
To develop a business, especially in a startup, they must pay attention and consider important aspects. Several articles and journals said that one of the aspects that must be considered is location. PT. MDS is one of the startups that offer a Point of Sales (POS) system to MSMEs; where to get this feature, MSMEs need to register by filling in personal data, including their business locations such as Province and City. In this step, the data entered is still typed manually, and make data entered into the database is unstructured. Therefore, this study aims to improve data quality by using the pre-processing method, Data Correction with Cosine Similarity and Jaro-Winkler Distance algorithms and Data Integration to complete the missing data. Implementing the pre-processing method itself can improve about 81.50% of Province data and 92.31% of City data. The Cosine Similarity algorithm is quite good at capturing and matching data at the word level, while Jaro-Winkler Distance is quite good at the string level. The Jaro-Winkler Distance algorithm is easier to implement than Cosine Similarity because Cosine Similarity requires converting the data into a matrix before implementing the algorithm. This study shows that combining the three methods mentioned can improve the quality of Province and City data by up to 99.36% and 97.99%. The data integration process itself successfully completes the missing data up to 97.38%.

Keywords
Data quality, Data pre-processing, Cosine similarity, Jaro-Winkler distance, MSMEs.

References
[1] Thabit Hassan Thabit, and Manaf Raewf, “The Evaluation of Marketing Mix Elements: A Case Study,” International Journal of Social Sciences & Educational Studies, vol. 4, no.4, pp. 100-109, 2018. Crossref, http://dx.doi.org/10.23918/ijsses.v4i4p100
[2] Nurul Indarti, “Business Location and Success: The Case of Internet Cafe Business in Indonesia,” Gadjah mada International Journal of Business, vol. 6, no. 2, pp. 171-192, 2004. Crossref, https://doi.org/10.22146/gamaijb.5543
[3] Rajkumar, P, “A Study of the Factors Influencing the Location Selection Decision of Information Technology Firms,” Asian Academy of Management Journal, vol. 18, no. 1, pp. 35–54, 2013.
[4] Hakim, B, “Data Text Pre-Processing Sentiment Analysis in Data Mining using Machine Learning,” Journal of Business and Audit Information Systems, vol. 4, no. 2, pp. 16-22, 2021.
[5] Jaka, A. T, “Preprocessing Text to Minimize Meaningless Words in the Text Mining Process,” Informatics Journal, vol. 1, 2015.
[6] Tumula Mani Harsha et al., "Survey on Resume Screening Mechanisms," SSRG International Journal of Computer Science and Engineering, vol. 9, no. 4, pp. 14-22, 2022. Crossref, https://doi.org/10.14445/23488387/IJCSE-V9I4P103
[7] Srividhya, V, and Anitha, R, “Evaluating Preprocessing Techniques in Text Categorization,” International Journal of Computer Science and Application Issue, pp. 49-51, 2010.
[8] Piska Dwi Nurfadila et al., “Journal Classification Using Cosine Similarity Method on Title and Abstract with Frequency-Based Stopword Removal,” International Journal of Artificial Intelligence Research, Crossref, https://doi.org/10.29099/ijair.v3i2.99
[9] Sugiyamto et al, “Performance Analysis of the Cosine and Jaccard Methods in the Document Similarity Test,” Journal of Informatics Society, vol. 5, no. 10, pp. 1-8, 2014.
[10] Nurdin et al, “Plagiarism Document Detection Using the Weigh Tree Method,” Telematics Journal, vol. 1 no. 1, 2019.
[11] Vikas Thad, and Dr Vivek Jaglan, “Comparison of Jaccard, Dice, Cosine Similarity Coefficient to Find Best Fitness Value for Web Retrieved Documents Using Genetric Algorithm,” International Journal of Innovations in Engineering and Technology, vol. 2, no. 4, pp. 202-205, 2013.
[12] Dedy Kurniadi, Sam Farisa Chaerul Haviana, and Andika Novianto et al., “Implementation of the Cosine Similarity Algorithm in Archive Document System at Sultan Agung Islamic University,” Journal of Transformation, vol. 17, no. 2, pp. 124-132, 2020.
[13] Joyassree Sen et al., "Face Recognition Using Deep Convolutional Network and One-shot Learning," SSRG International Journal of Computer Science and Engineering, vol. 7, no. 4, pp. 23-29, 2020. Crossref, https://doi.org/10.14445/23488387/IJCSE-V7I4P107
[14] Friendly, “Improvements to the Jaro-Winkler Distance Method for Approximate String Search Using Indexed Data for Multi-User Applications,” Technology Journal, vol. 04, no. 02 pp. 69 – 78, 2017.
[15] Yulianingsih, “Implementation of Jaro-Winkler and Levenstein Distance Algorithms in Searching Data in Databases,” Journal of Technology Research and Innovation Writing Unit, vol. 2, no. 1, 2017.
[16] Munjiah Nur Saada et al., “Information Retrieval of Text Document with Weighting TF-IDF and LCS,” Journal of Computer Sciences and Information, vol. 6, no. 1, 2013. Crossref, https://doi.org/10.21609/jiki.v6i1.216
[17] Chunhao Huang et al, “Text Retrieval Technology Based on Keyword Retrieval,” Journal of Physics: Conference Series, 2020. Crossref, https://doi.org/10.1088/1742-6596/1607/1/012108
[18] Lediknas, Provinces Regencies and Cities in Indonesia, 2022. [Online]. Accessed https://www.lediknas.com/provinsi-kabupaten-dankota-di-indonesia
[19] Pooja Goyal, Sushil Kumar, and Komal Kumar Bhatia, "Hashing and Clustering Based Novelty Detection," SSRG International Journal of Computer Science and Engineering, vol. 6, no. 6, pp. 1-9, 2019. Crossref, https://doi.org/10.14445/23488387/IJCSE-V6I6P101
[20] Sidiq, M, “The Effect of Pre-Process on Sentiment Analysis in Indonesian Language Texts,” Thesis, 2019.
[21] Luís Batista, and Luís A. Alexandre, “Text Pre-processing for Lossless Compression,” Data Compression Conference, pp. 506-506, 2008. Crossref, https://doi.org/10.1109/DCC.2008.78
[22] OECD, Data Correction, 2022. [Online]. Available: https://stats.oecd.org/glossary/detail.asp?ID=3402
[23] Ariantini, D. A et al., “Measurement of Similarity to Indonesian Text Documents Using the Cosine Similarity Method,” E-Journal of Computer Science, vol. 9, no. 1, 2016.
[24] Makmun, Agus, “Performance Study of Similarity Algorithm for Identification and Mapping of SWOT Statements,” Muhammadiyah University of Surakarta,” Final Project: 2018.
[25] Setyaji, Arso, “Analysis of Taxis Significant Translation in the Novel “The Old Man and the Sea” (Systemic Functional Linguistics Approach),” Indonesian Surakarta, 2018.
[26] Wang, Lidong, “Heterogeneous Data and Big Data Analytics,” Automatic Control and Information Sciences, vol. 3, no. 1, pp. 8-15, 2017. Crossref, https://doi.org/10.12691/acis-3-1-3
[27] Khadim, A. I, “An Evaluation of Preprocessing Tehcniques for Text Classfication,” International Journal of Computer Science and Information Security, vol. 16, no. 6, pp. 22-32, 2018.
[28] Novantara, P, and Pasruli, O, “Implementation of the Jaro-Winkler Distance Algorithm for Plagiarism Detection Systems in Thesis Documents,” Journal of Buffer Informatics, vol. 3, no. 2, 2017.
[29] Jumeilah, F. S, “Application of Support Vector Machine (SVM) for Research Categorization,” Journal of Systems Engineering and Information Technology, vol. 1 no. 1, pp. 19-25, 2017.