The Effective Role of Hubness in Clustering High-Dimensional Data

  IJETT-book-cover  International Journal of Engineering Trends and Technology (IJETT)          
© 2016 by IJETT Journal
Volume-33 Number-1
Year of Publication : 2016
Authors : Mrs. C.Navamani, Mr. N.Sakthivel


Mrs. C.Navamani, Mr. N.Sakthivel"The Effective Role of Hubness in Clustering High-Dimensional Data", International Journal of Engineering Trends and Technology (IJETT), V33(1),33-38 March 2016. ISSN:2231-5381. published by seventh sense research group

High-dimensional data arise naturally in many domains, and have regularly presented a great challenge of traditional data mining techniques, both in terms of effectiveness or efficiency. Clustering becomes difficult due to the increasing sparsity and such data, as well as the increasing difficulty in distinguishing distances between data points. then this paper, we take a novel perspective on the problem on clustering high-dimensional data. Instead of attempting to avoid the curse on dimensionality by observing a lower dimensional feature subspace, we can embrace dimensionality on taking advantage on inherently high-dimensional phenomena. More specifically, we can that hubness, the tendency of high-dimensional data to contain points (hubs) that frequently occur in k- nearest-neighbor lists of other points, can be successfully exploited in clustering. We can validate our hypothesis by demonstrating that hubness is a good measure of the point centrality within a high-dimensional data cluster, and by proposing several hubness-based clustering algorithms, showing that major hubs can be used effectively as cluster prototypes or as guides during the search for centroid-based cluster configurations. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise. The proposed methods are tailored mostly for detecting approximately hyperspherical clusters and need to be extended to properly handle clusters of arbitrary shapes.


[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, second ed. Morgan Kaufmann, 2006.
[2] C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. 26th ACM SIGMOD Int‟l Conf. Management of Data, pp. 70-81, 2000.
[3] K. Kailing, H.-P. Kriegel, P. Kro ¨ger, and S. Wanka, “Ranking Interesting Subspaces for Clustering High Dimensional Data,” Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 241-252, 2003.
[4] K. Kailing, H.-P. Kriegel, and P. Kro ¨ger, “Density- Connected Subspace Clustering for High-Dimensional Data,” Proc. Fourth SIAM Int‟l Conf. Data Mining (SDM), pp. 246- 257, 2004.
[5] E. Mu ¨ller, S. Gu ¨nnemann, I. Assent, and T. Seidl, “Evaluating Clustering in Subspace Projections of High Dimensional Data,” Proc. VLDB Endowment, vol. 2, pp. 1270- 1281, 2009.
[6] C.C. Aggarwal, A. Hinneburg, and D.A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces,” Proc. Eighth Int‟l Conf. Database Theory (ICDT), pp. 420-434, 2001.
[7] D. Franc¸ois, V. Wertz, and M. Verleysen, “The Concentration of Fractional Distances,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7, pp. 873-886, July 2007.
[8] R.J. Durrant and A. Kaba ´n, “When Is „Nearest Neighbour‟ Meaningful: A Converse Theorem and Implications,” J. Complex- ity, vol. 25, no. 4, pp. 385-397, 2009.
[9] A. Kaba ´n, “Non-Parametric Detection of Meaningless Distances in High Dimensional Data,” Statistics and Computing, vol. 22, no. 2, pp. 375-385, 2012.
[10] E. Agirre, D. Martı ´nez, O.L. de Lacalle, and A. Soroa, “Two Graph-Based Algorithms for State-of-the-Art WSD,” Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 585-593, 2006.
[11] K. Ning, H. Ng, S. Srihari, H. Leong, and A. Nesvizhskii, “Examination of the Relationship between Essential Genes in PPI Network and Hub Proteins in Reverse Nearest Neighbor Topology,” BMC Bioinformatics, vol. 11, pp. 1-14, 2010.
[12] D. Arthur and S. Vassilvitskii, “K-Means++: The Advantages of Careful Seeding,” Proc. 18th Ann. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 1027-1035, 2007.

Detecting approximately hyperspherical clusters and need to be extended to properly handle clusters of arbitrary shapes.