Machine Learning Approach by Document Clustering using Probability of Word Occurrences

Aranga Arivarasan; Dr.M.Karthikeyan

doi:https://doi.org/10.14445/22315381/IJETT-V67I7P220

Research Article | Open Access | Download PDF

Volume 67 | Issue 7 | Year 2019 | Article Id. IJETT-V67I7P220 | DOI : https://doi.org/10.14445/22315381/IJETT-V67I7P220

Machine Learning Approach by Document Clustering using Probability of Word Occurrences

Aranga Arivarasan ,Dr.M.Karthikeyan

Citation :

Aranga Arivarasan ,Dr.M.Karthikeyan, "Machine Learning Approach by Document Clustering using Probability of Word Occurrences," International Journal of Engineering Trends and Technology (IJETT), vol. 67, no. 7, pp. 101-106, 2019. Crossref, https://doi.org/10.14445/22315381/IJETT-V67I7P220

Abstract

Now a day the rapid increase in the fields of internet, data science, big data and data mining the extraction of hidden information from the documents become a challenging task. The text document doesn’t have the flexibility of easily understanding the context for which it was written like images. So it is necessary to extract the correct features and similarity measures to categorize the document to extract the information. In the proposed work the probability values of occurrences of similar words from a document is extracted to categorize it to its topic. The method also uses an elaborated preprocessing technique for the dimensionality reduction as well as removal of unnecessary vectors from the text documents. The proposed method uses three similarity measures to evaluate the categorization. The final results show that the spearman similarity yields a better result with an accuracy of 95.7.

Keywords

Probability, Similarity Metrics, pre-processing, Clustering, K-Means

References

[1] Saqib Alam, Nianmin Yao, “Big Data Analytics, Text Mining and Modern English Language” journal of Grid Computing 2018
[2] Vladimer B. Kobayashi1, Stefan T. Mol1, Hannah A. Berkers1, Ga´bor Kismiho´k1and Deanne N. Den Hartog, “Text Mining in Organizational Research”, Organizational Research Methods,21(3), 733-765.2018.
[3] Robert Wing Pong Luk, Kam-Fai Wong, Kui-Lam ACM Kwok “Interpreting TF-IDF Term Weights as Making Relevance Decisions”, Transactions on Information Systems, Vol. 26(3) 2008.
[4] Dibyendu Mondal Pushpak, Raksha Sharma, “Comparison among Significance Tests and Other Feature Building Methods for Sentiment Analysis: A First Study”, International Conference on Computational Linguistics and Intelligent Text Processing, pp.3-19, 2017.
[5] Kasula Chaithanya Pramodh, Dr.P.Vijayapal Reddy, “A Novel approach for Document Clustering using concept extraction”, International Journal of Innovative Research in Advanced Engineering, 05(3),pp.59-65, 2016.
[6] Charu C. Aggarwal, ChengXiang Zhai. “A survey of text classification algorithms”, Mining text data. pp.163–222, (2012)
[7] Borovikov, E. “A survey of modern optical character recognition techniques”, Computer Vision and Pattern Recognition (2014)
[8] Bsoul, Q., Salim, J., Zakaria, L. Q. “An intelligent document clustering approach to detect crime patterns”, Procedia Technology, 11, pp.1181–1187, 2013.
[9] Cohen Priva, U., Austerweil, J. L., “Analyzing the history of cognition using topic models”, Cognition, 135, pp.4–9, 2015.
[10] Aranzabe, M. J., A. D. de Ilarraza & I. Gonzalez-Dios . “TransformingComplex Sentences using Dependency Trees for Automatic Text Simplificationin Basque”, SEPLN, pp. 61–68. 2012
[11] Matthew Honnibal and Ines Montani. spacy “ Natural language understanding with bloom embeddings”, convolutional neural networks and incremental parsing. 2017
[12] Sowmya Vajjalla and Detmar Meurers “Readability assessment for text simplification: From analysing documents to identifying sentential simplifications”, International Journal of Applied Linguistics, 165(2)pp.194–222, 2015.
[13] Yuqiang Tong, authorLize Gu, "A News Text Clustering Method Based on Similarity of Text Labels", Advanced Hybrid Information Processing,279 pp.496-503, 2018.
[14] Marzieh Oghbaie, Morteza Mohammadi Zanjireh, " Pairwise document similarity measure based on present term set", Journal of Big Data, 5:52,2018.
[15] Marmar MoussaIon, I. M?ndoiu , "Single cell RNA-seq data clustering using TF-IDF based methods",BMC Genomics 19(Supl 6) : 569 ,2018
[16] Yehang Zhu,Mingjie Zhang, Feng Shi, "Application of Algorithm CARDBK in Document Clustering", Wuhan University Journal of Natural Sciences, 23:6, pp.514-524, 2018.