A Survey on Improving the Clustering Performance in Text Mining for Efficient Information Retrieval

**Citation**

S.Saranya , R.Munieswari."A Survey on Improving the Clustering Performance in Text Mining for Efficient Information Retrieval", International Journal of Engineering Trends and Technology(IJETT), V8(5),249-256 February 2014. ISSN:2231-5381. www.ijettjournal.org. published by seventh sense research group

**Abstract**

In recent years, the development of information systems in every field such as business, academics and medicine has led to increase in the amount of stored data year by year. A vast majority of data are stored in documents that are virtually unstructured. Text mining technology is very helpful for people to process huge information by imposing structure upon text. Clustering is a popular technique for automatically organizing a large collection of text. However, in real application domains, the experimenter possesses some background knowledge that helps in clustering the data. Traditional clustering techniques are rather unsuitable of multiple data types and cannot handle sparsity and high dimensional data. Co-clustering techniques are adopted to overcome the traditional clustering technique by simultaneously performing document and word clustering handling both deficiencies. Semantic understanding has become essential ingredient for information extraction, which is made by adopting constraints as a semi-supervised learning strategy. This survey reviews on the constrained co-clustering strategies adopted by researchers to boost the clustering performance. Experimental results using 20-Newsgroups dataset shows that the proposed method is effective for clustering textual documents. Furthermore, the proposed algorithm consistently outperformed all the existing constrained clustering and coclustering methods under different conditions.

**References**

[1] Banerjee.A, Dhillon.I, Ghosh.J,. Merugu.S, and Modha.D.S (2007), “A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximation,” J. Machine Learning Research, vol. 8, pp. 1919-1986.

[2] Basu S., Bilenko M., and Mooney R.J. (2004), “A Probabilistic Framework for Semi-Supervised Clustering,” Proc. SIGKDD, pp. 59-68.

[3] Basu.S, Banerjee A., and Mooney R.J. (2002), “Semi-Supervised Clustering by Seeding,” Proc. 19th Int’l Conf. Machine Learning (ICML), pp. 27-34.

[4] Bikel D., Schwartz R., and Weischedel R. (1999),” An algorithm that learns what’s in a name”, Machine learning, 34:211–231.

[5] Bilenko M. and Basu S.(2004), “A Comparison of Inference Techniques for Semi-Supervised Clustering with Hidden Markov Random Fields,” Proc. ICML Workshop Statistical Relational Learning (SRL ’04).

[6] Bilenko.M, Basu.S, and Mooney R.J. (2004), “Integrating Constraints and Metric Learning in Semi-Supervised Clustering,” Proc. 21st Int’l Conf. Machine Learning (ICML), pp. 81-88.

[7] Chen Y., Wang L., and Dong M.(2010), “Non-Negative Matrix Factorization for Semi-Supervised Heterogeneous Data Co- Clustering,” IEEE Trans. Knowledge and Data Eng., vol.22, no. 10, pp. 1459-1474.

[8] Cheng Y. and Church G.M. (2000), “Biclustering of Expression Data,” Proc. Int’l System for Molecular Biology Conf. (ISMB), pp. 93-103.

[9] Cho H., Dhillon I.S., Guan Y., and Sra S. (2004), “Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data,” Proc. Fourth SIAM Int’l Conf. Datamining (SDM).

[10] Cozman F.G., Cohen I., and Cirelo M.C. (2003), “Semi-Supervised Learning of Mixture Models,” Proc. Int’l Conf. Machine Learning (ICML), pp. 99-106.

[11] Dai W., Xue G.-R., Yang Q., and Yu Y. (2007), “Co-Clustering Based Classification for Out-of-Domain Documents,” Proc. 13th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 210- 219.

[12] Dhillon I.S. (2001), “Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning,” Proc. Seventh ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining(KDD), pp. 269-274.

[13] Dhillon.I.S, Mallela.S, and Modha D.S.(2003), “Information-Theoretic Co-Clustering,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 89-98.

[14] Ding C., Li.T, Peng.W, and Park.H (2006), “Orthogonal Nonnegative Matrix T-Factorizations for Clustering,” Proc. 12th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 126-135.

[15] Gao.B, Liu T.-Y., Feng G., Qin T., Cheng Q.-S. And Ma W.-Y. (2005) ,“Hierarchical Taxonomy Preparation for Text Categorization Using Consistent Bipartite Spectral Graph Co partitioning,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 9, pp. 1263-1273.

[16] Jain.A, Murty.M, and Flynn.P (1999), “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323.

[17] Li T., Ding C., Zhang Y., and Shao B. (2008), “Knowledge Transformation from Word Space to Document Space,” Proc. 31st Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 187-194.

[18] Li T., Zhang Y., and Sindhwani V.(2009), “A Non-Negative Matrix Tri- Factorization Approach to Sentiment Classification with Lexical Prior Knowledge,” Proc. Joint Conf. (ACL-IJCNLP), pp. 244-252.

[19] Long.B, Wu X., Zhang Z. and Yu. P.S (2006), “Spectral Clustering for Multi-Type Relational Data,” Proc. 23rd Int’l Conf. Machine Learning, pp. 585-592. [20] Lu Z. and Leen T.K. (2007), “Penalized Probabilistic Clustering,” Neural Computation, vol. 19, no. 6, pp. 1528-1567.

[21] Michael W. Berry and Malu Castellanos (2007),”Survey of Text Mining: Clustering, Classification, and Retrieval”, Springer, Second Edition.

[22] Nigam K., McCallum A.K., Thrun S., and Mitchell T.M. (2000), “Text Classification from Labeled and Unlabeled Documents using EM,” Machine Learning, vol. 39, no. 2/3, pp. 103-134.

[23] Pensa R.G. and Boulicaut J.-F.(2008), “Constrained Co-Clustering of Gene Expression Data,” Proc. SIAM Int’l Conf. Data Mining (SDM), pp. 25-36.

[24] Revathi.T, Sumathi.P (2013),” A Survey on Data Mining using Clustering Techniques”, International Journal of Scientific & Engineering Research Volume 4, Issue 1.

[25] Rui Xu, Donald Wunsch II (2005),” Survey of Clustering Algorithms”, IEEE Transactions On Neural Networks, Vol. 16, NO. 3, pp. 645-678.

[26] Shan.H and A. Banerjee.A (2008), “Bayesian Co-Clustering,” Proc. IEEE Eight Int’l Conf. DataMining (ICDM), pp. 530-539.

[27] Shi X., Fan W., and Yu P.S. (2010), “Efficient Semi-Supervised Spectral Co-Clustering with Constraints,” Proc. IEEE 10th Int’l Conf. Data Mining (ICDM), pp. 1043-1048.

[28] Song Y., Pan S., Liu S., Wei F., Zhou M.X., and Qian W. (2010), “Constrained Co-Clustering for Textual Documents,” Proc. Conf. Artificial Intelligence (AAAI).

[29] Wagstaff K., Cardie C., Rogers S., and Schro¨ dl S. (2001), “Constrained Kmeans Clustering with Background Knowledge,” Proc. 18th Int’l Conf. Machine Learning (ICML), pp. 577-584.

[30] Wang F., Li T., and Zhang C. (2008), “Semi-Supervised Clustering via Matrix Factorization”, Proc. SIAM Int’l Conf. Data Mining (SDM), pp. 1-12.

[31] Wang.P, Domeniconi.C , and Laskey K.B. (2009), “Latent Dirichlet Bayesian Co-Clustering,” Proc. European Conf. Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pp. 522-537.

[32] Xing E.P., Ng A.Y., Jordan M.I., and Russell S.J. (2002), “Distance Metric Learning with Application to Clustering with Side-Information,” Proc. Advances in Neural Information Processing Systems Conf., pp. 505-512.

[33] Xu.W, Liu.X, and Gong.Y (2003), “Document Clustering Based on Non-Negative Matrix Factorization,” Proc. 26th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 267-273.

[34] Yang M.-S. (1993),”A Survey of Fuzzy Clustering”, Mathl. Comput. Modelling Vol. 18, No.11, pp. 1-16.

[35] Yang T., Jin R., and Jain A.K. (2010), “Learning from Noisy Side Information by Generalized Maximum Entropy Model,” Proc. Int’l Conf. Machine Learning (ICML), pp. 1199-1206.

[36] Yangqiu Song, Shimei Pan, Shixia Liu, Furu Wei (2013), “Constrained Text Coclustering with Supervised and Unsupervised Constraints”, IEEE Transactions On Knowledge And Data Engineering, Vol. 25, No. 6.

**Keywords**

Clustering Techniques, Co-Clustering, Constrained Clustering, Semisupervised Learning, Text Mining.