Content Oriented Automatic Text Categorization

N.Balasundaraganapathy; P.Yuvarasu; Mr.Prabhaka

doi:https://doi.org/10.14445/22315381/IJETT-V1I1P28

Research Article | Open Access | Download PDF

Volume 1 | Issue 1 | Year 2011 | Article Id. IJETT-V1I1P28 | DOI : https://doi.org/10.14445/22315381/IJETT-V1I1P28

Content Oriented Automatic Text Categorization

N.Balasundaraganapathy, P.Yuvarasu, Mr.Prabhaka

Citation :

N.Balasundaraganapathy, P.Yuvarasu, Mr.Prabhaka, "Content Oriented Automatic Text Categorization," International Journal of Engineering Trends and Technology (IJETT), vol. 1, no. 1, pp. 128-130, 2011. Crossref, https://doi.org/10.14445/22315381/IJETT-V1I1P28

Abstract

The project is to implement a web spam classifier, which given a web page, will analyze its features and try to determine whether the page is spam or not. The efficiency of the classifier will be compared t o the results spam detection in the text datasets using Naïve Baye’s classifier text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a c omputer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the perfo rmance of text categorization. In this study, the investigate several widely - used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with NLP and Clustering algorithms. In consideration of the dist ribution of relevant documents in the collection, the propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms` discriminating power for text categorization task. a consistently better performance while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.

Keywords

ATC, HTTP, QLA, ECML

References

[1] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf.Machine Learning (ECML), pp. 137 - 142, 1998.
[2] A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher, “Spamrank — Fully automatic link spam detection,” in Proc. First Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb , C hiba, Japan, 2005, pp. 25 – 38.
[3] C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna, “A reference collection for web spam,” SIGIR Forum ,vol. 40, no. 2, pp. 11 – 24, 2006.
[4] T. Joachims, Learning to Classify Text Using Su pport Vector Machines — Methods, Theory, and Algorithms. Kluwer/Springer, 2002.
[5] T. M. Cover and J. A. Thomas , Elements of Information Theory . New York: Wiley - Interscience, 1991
[6] K. Bharat and M. R. Henzinger, “Improved algorithms for topic distillatio n in a hyperlinked environment,” in Proc. 21st Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval , New York, 1998, pp. 104 – 111, ACM