Improve Data Text Quality by Applying Text Pre-Processing Method (Case Study)

© 2023 by IJETT Journal
Volume-71 Issue-1
Year of Publication : 2023
Author : Rizky Dwi Novyantika, Sani Muhamad Isa
DOI : 10.14445/22315381/IJETT-V71I1P209

How to Cite?

Rizky Dwi Novyantika, Sani Muhamad Isa, "Improve Data Text Quality by Applying Text Pre-Processing Method (Case Study)," International Journal of Engineering Trends and Technology, vol. 71, no. 1, pp. 94-108, 2023. Crossref,

To develop a business, especially in a startup, they must pay attention and consider important aspects. Several articles and journals said that one of the aspects that must be considered is location. PT. MDS is one of the startups that offer a Point of Sales (POS) system to MSMEs; where to get this feature, MSMEs need to register by filling in personal data, including their business locations such as Province and City. In this step, the data entered is still typed manually, and make data entered into the database is unstructured. Therefore, this study aims to improve data quality by using the pre-processing method, Data Correction with Cosine Similarity and Jaro-Winkler Distance algorithms and Data Integration to complete the missing data. Implementing the pre-processing method itself can improve about 81.50% of Province data and 92.31% of City data. The Cosine Similarity algorithm is quite good at capturing and matching data at the word level, while Jaro-Winkler Distance is quite good at the string level. The Jaro-Winkler Distance algorithm is easier to implement than Cosine Similarity because Cosine Similarity requires converting the data into a matrix before implementing the algorithm. This study shows that combining the three methods mentioned can improve the quality of Province and City data by up to 99.36% and 97.99%. The data integration process itself successfully completes the missing data up to 97.38%.

Data quality, Data pre-processing, Cosine similarity, Jaro-Winkler distance, MSMEs.

