Machine Learning Approach by Document Clustering using Probability of Word Occurrences

Volume-67 Issue-7
Year of Publication : 2019
Authors : Aranga Arivarasan ,Dr.M.Karthikeyan
DOI :  10.14445/22315381/IJETT-V67I7P220


Now a day the rapid increase in the fields of internet, data science, big data and data mining the extraction of hidden information from the documents become a challenging task. The text document doesn’t have the flexibility of easily understanding the context for which it was written like images. So it is necessary to extract the correct features and similarity measures to categorize the document to extract the information. In the proposed work the probability values of occurrences of similar words from a document is extracted to categorize it to its topic. The method also uses an elaborated preprocessing technique for the dimensionality reduction as well as removal of unnecessary vectors from the text documents. The proposed method uses three similarity measures to evaluate the categorization. The final results show that the spearman similarity yields a better result with an accuracy of 95.7.

Probability, Similarity Metrics, pre-processing, Clustering, K-Means