Soft Computing based Duplicate Text Identification in Online Community Websites

Basavesha D; Dr. Y S Nijagunarya

doi:https://doi.org/10.14445/22315381/IJETT-V68I7P201S

Research Article | Open Access | Download PDF

Volume 68 | Issue 7 | Year 2020 | Article Id. IJETT-V68I7P201S | DOI : https://doi.org/10.14445/22315381/IJETT-V68I7P201S

Soft Computing based Duplicate Text Identification in Online Community Websites

Basavesha D, Dr. Y S Nijagunarya

Citation :

Basavesha D, Dr. Y S Nijagunarya, "Soft Computing based Duplicate Text Identification in Online Community Websites," International Journal of Engineering Trends and Technology (IJETT), vol. 68, no. 7, pp. 1-7, 2020. Crossref, https://doi.org/10.14445/22315381/IJETT-V68I7P201S

Abstract

As the number of social media websites and applications are increasing the amount and the speed of data generation is also increasing and in turn the chances of having duplicates in the data are also increasing. The presence of duplicates will reduce the quality of data and also deteriorates the accuracy of the final results. Therefore, identifying and removing the duplicates is very important and it is considered to be a necessary step in data preprocessing and data integration. In this paper we have made an extensive review on the state-of-the art literature in the field of duplicate text identification. The paper consists of a survey on the works related to duplicate data identification, duplicate text identification and duplicate record identification. We have discussed generalized step by step procedure for duplicate text identification that is followed by most of the researchers. We described about word embedding techniques, similarity estimation techniques, and different soft computing techniques such as neural networks, fuzzy logic, evolutionary algorithms, Bayesian networks and support vector machines. We summarized the state-of-the-art works in three categories like, duplicate question identification in quora and stack overflow, text identification in documents and record identification in small and large datasets. Finally we also discussed about the different metrics used to measure the performance of the model developed for duplicate identification.

Keywords

Duplicate text, soft computing, neural network, fuzzy logic, bag-of-words.

References

[1] E V Sharapova and R V Sharapov, “The problem of fuzzy duplicate detection of large texts”, IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018).
[2] John Rathbone*, Matt Carter, Tammy Hoffmann and Paul Glasziou, “Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant- Deduplication Module”, Rathbone et al. Systematic Reviews 2014, 4:6.
[3] Thorsten Papenbrock, Arvid Heise, and Felix Naumann, “Progressive Duplicate Detection”, IEEE Transactions on Knowledge and Data Engineering, 1041-4347 (c) 2013 IEEE.
[4] Yun Zhang, David Lo, Xin Xia, Jian-Ling Sun, “Multi- Factor Duplicate Question Detection in Stack Overflow”, JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 30(5): 981–997 Sept. 2015.
[5] John Rathbone*, Matt Carter, Tammy Hoffmann and Paul Glasziou, “Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant- Deduplication Module”, Rathbone et al. Systematic Reviews 2014, 4:6.
[6] Thorsten Papenbrock, Arvid Heise, and Felix Naumann, “Progressive Duplicate Detection”, IEEE Transactions on Knowledge and Data Engineering, 1041-4347 (c) 2013 IEEE.
[7] Yushi Homma, Stuart Sy, Christopher Yeh, “Detecting Duplicate Questions with Deep Learning”, 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain
[8] Nithya. P, Vinothini. K, “Duplicate Detection in XML Data Using Probabilistic Duplicate Detection Algortihm”, International Journal of Engineering Research & Technology, Vol. 3 Issue 1, January – 2014.
[9] Nikhil Gawande, S. R. Todamal, “ A Survey on Duplicate Detection in Hierarchical Data”, International Journal of Science and Research, Volume 3 Issue 12, December 2014.
[10] Ms. Girija. M, “Handling Duplicate Data Detection Of Query Result From Multiple Web Databases Using Unsupervised Duplicate Detection With Blocking Algorithm”, International Research Journal of Engineering and Technology, Volume: 03 Issue: 04 | Apr-2016
[11] Travis Addair, “Duplicate questin pair detection with deep learning”. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.117 4/reports/2759336.pdf
[12] Lei Guo, Chong Li, Haiming Tian, “Duplicate Quora Questions Detection”, https://pdfs.semanticscholar.org/4c19/2b8f45b1e913ee7da3 2624cd7559eccb0890.pdf
[13] Sujith Viswanathan, Nikhil Damodaran, Anson Simon, Anon George, M. Anand Kumar and K. P. Soman, “Detection of Duplicates in Quora and Twitter Corpus”, J. D. Peter et al. (eds.), Advances in Big Data and Cloud Computing, Advances in Intelligent Systems and Computing 750, © Springer Nature Singapore Pte Ltd. 2019.
[14] Chakaveh Saedi, Jo˜ao Rodrigues, Jo˜ao Silva, Ant´onio Branco, Vladislav Maraev, “Learning Profiles in Duplicate Question Detection”, IEEE International Conference on Information Reuse and Integration (IRI), 2017.
[15] Jin Gao, Yahao He, Xiaoyan Zhang, Yamei Xia, “Duplicate Short text detection using Word2Vec”, 978-1-5386-0497- 7/17/$31.00 ©2017 IEEE.
[16] Yifang Sun, Jianbin Qin, and Wei Wang, “Near Duplicate Text Detection Using Frequency-Biased Signatures”, X. Lin et al. (Eds.): WISE 2013, Part I, LNCS 8180, pp. 277–291, 2013.c_Springer-Verlag Berlin Heidelberg 2013.
[17] Jo˜ao Rodrigues, Chakaveh Saedi, Ant´onio Branco and Jo˜ao Silva, “Semantic Equivalence Detection: Are Interrogatives Harder than Declaratives?”. http://www.di.fc.ul.pt/~ahb/pubs/2018RodriguesSaediBranc oEtAl.pdf.
[18] Zainab Imtiaz, Muhammad Umer, Muhammad Ahmad, Saleem Ullah, Gyu Sang Choi, and Arif Mehmood, “Duplicate Questions Pair Detection Using Siamese MaLSTM”, IEEE Access, VOLUME , 2019.
[19] Marios Poulos, “Near Duplicate Text Detection using Graph Depiction”. https://www.researchgate.net/publication/311756563_Near_ duplicate_text_detection_using_graph_depiction.
[20] Abram Hindle1 · Anahita Alipour1 · Eleni Stroulia, “A contextual approach towards more accurate duplicate bug report detection and ranking”, Springer Science+Business Media New York 2015.
[21] JIANKUN YU, MENGRONG LI, DENGYIN ZHANG, “Duplicate text detection based on LCS algorithm”, 2nd Information Technology and Mechatronics Engineering Conference (ITOEC 2016).
[22] P.Lakshmi Prasanna, S.Manogni , P.Tejaswini , K.Tanmay Kumar , K.Manasa, “Document Classification Using KNN with Fuzzy Bags of Word Representation”, International Journal of Recent Technology and Engineering ISSN: 2277-3878, Volume-7, Issue-6S, March 2019.
[23] Yuliang Xiu, Xiaoting Jiang, Weiyu Cheng, Bowen Zhang, “Quora Question Pairs @ Kaggle”, Shanghai Jiao Tong University, X033525, June 7, 2017.
[24] Ramya R S, Venugopal K R, Iyengar S S & Patnaik L, “Feature Extraction and Duplicate Detection for Text Mining: A Survey”, Global Journal of Computer Science and Technology: C Software & Data Engineering, Volume 16 Issue 5 Version 1.0 Year 2016.
[25] Nayana, Y., J. Gopinath, and L. Girish. "DDoS mitigation using Software Defined Network." International Journal of Engineering Trends and Technology (IJETT) 24.5 (2015): 258-264