Comparing PMI-based to Cluster-based Arabic Single Document Summarization Approaches

  IJETT-book-cover  International Journal of Engineering Trends and Technology (IJETT)          
  
© 2014 by IJETT Journal
Volume-11 Number-8
Year of Publication : 2014
Authors : Madeeh Nayer El-Gedawy
  10.14445/22315381/IJETT-V11P274

Citation 

Madeeh Nayer El-Gedawy. "Comparing PMI-based to Cluster-based Arabic Single Document Summarization Approaches", International Journal of Engineering Trends and Technology (IJETT), V11(8),379-383 May 2014. ISSN:2231-5381. www.ijettjournal.org. published by seventh sense research group

Abstract

In this paper, two extractive techniques are applied to handle Arabic Single Document Text summarization problem (SDS); the first uses a K-Means clustering approach and the other uses mutual information (MI) which is broadly used to measure the co-occurrence between two words in text mining. A successful Arabic document summarization algorithm should identify noteworthy sentences in the documents as accurately as possible. The terms used in the document (the distinct words) represent the document`s identity, and instead of Bag of Words (BoW); a Term-Sentence Matrix (TSM) is utilized. In the first approach, the text themes are extracted using K-Means then one sentence per Cluster is chosen to be part of the summary using TFIDF weights. In the other approach, the pointwise mutual information (PMI) is used to assign weights for each cell in the TSM. The matrix generated from this TSM, is used to extract a summary of the document. experimentations prove that the cluster-based methodology performs slightly better than the first one, but if the end user could tweak the summary percentage to appropriate level then, the PMI-based approach will be slightly better.

References

[1] John Gantz and David Reinsel, Ext ract ing Value from Chaos, I D C I V I E W, EMC Corporation, 2011.
[2] Karel Jezek and Josef Steinberger, "Automatic Text summarization", Vaclav Snasel, pp. 1-12, 2008.
[3] Farshad Kyoomarsi, Hamid Khosravi, Esfandiar Eslamiand Pooya, and Khosravyan Dehkordy, “Optimizing Text Summarization Based on Fuzzy Logic”, In proceedings of Seventh IEEE/ACIS International Conference on Computer and Information Science, IEEE, UK, pp. 347-352, 2008.
[4] G. Erkan and Dragomir R. Radev, “LexRank: Graph-based Centrality as Salience in Text Summarization”, Journal of Artificial Intelligence Research, Re-search, vol. 22, pp. 457-479, 2004.
[5] A. Morris, G. Kasper, and D. Adams. “The Effects and Limitations of Automated Text Condensing on Reading Comprehension Performance”. Information Systems Research, vol. 3, pp. 17–35, 1992.
[6] Dagan, O. Glickman, A. Gliozzo, E. Marmorshtein and C. Strapparava, “Direct Word Sense Matching for Lexical Substitution”. In Proc. of COLING/ACL-06, pp. 449-456, 2006.
[7] Y. Liu, “A Comparative Study on Feature Selection Methods for Drug Discovery”, J. Chem. Inf. Comput. Sci., vol. 44, pp. 1823-1828, 2004.
[8] Harshal J. Jain, M. S. Bewoor, and S. H. Patil, “Context Sensitive Text Summarization Using K-Means Clustering Algorithm”, International Journal of Soft Computing and Engineering, vol. 2, May 2012.
[9] Madeeh Nayer El-gedawy, “Using Fuzzifiers to solve Word Sense Ambiguation in Arabic Language”, International Journal of Computer Applications, vol. 79, October 2013.
[10] Jackie Cheung, “Comparing Abstractive and Extractive Summarization of Evaluative Text: Controversiality and Content Selection”, B. Sc. (Hons.) Thesis in the Department of Computer Science of the Faculty of Science, University of British Columbia, 2008.
[11] Josef Steinberger, Massimo Poesio, Mijail A. Kabadjov, and Karel Jeek, “Two uses of anaphora resolution in summarization”, Information Processing and Management: an International Journal, vol.43, pp.1663-1680, November 2007.
[12] Ani Nenkova and Kathleen McKeown. Mining Text Data, A Survey of Text Summarization Techniques, Springer, pp 43-76., 2012.
[13] Vishal Gupta and Gurpreet Singh Lehal. “A Survey of Text Summarization Extractive Techniques”, JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, AUGUST 2010.
[14] Madeeh Nayer El-gedawy. “ Orthogonal Processing for Measuring the Tonality of Egyptian Microblogs”. International Journal of Computer Applications, vol. 87, pp. 20-25, February 2014.
[15] Zhou Yao and Cao Ze-wen. "Research on the Construction and Filter Method of Stop-word List in Text Preprocessing", Proceedings of the 2011 Fourth International Conference on Intelligent Computation Technology and Automation, vol. 1, 2011.
[16] Shereen Khoja and Roger Garside. “Stemming Arabic text”. Computer Science Department, Lancaster University, Lancaster, UK, http://www.comp.lancs.ac.uk/ computing/users/khoja/stemmer.ps, 1999.
[17] S Aji and Ramachandra Kaimal. “DOCUMENT SUMMARIZATION USING POSITIVE POINTWISE MUTUAL INFORMATION”. International Journal of Computer Science & Information Technology (IJCSIT), vol 4, April 2012.
[18] Anna Huang, “Similarity Measures for Text Document Clustering”, NZCSRSC, Christchurch, New Zealand, April 2008.
[19] Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, In KDD Workshop on Text Mining, 2000.
[20] Hanane Froud, Abdelmonaime Lachkar, and Said Alaoui Ouatik, “Arabic text summarization based on latent semantic analysis to enhance Arabic documents clustering”, International Journal of Data Mining & Knowledge Management Process (IJDKP), 2013.
[21] Madeeh Nayer El-Gedawy, “TARGETING POVERTY IN EGYPT USING K-MEANS ALGORITHM”, IDSC-WPS, Working Paper No. 12, 2010.
[22] RM Aliguliyev, “A new sentence similarity measure and sentence based extractive technique for automatic text summarization”, Expert Systems with Applications, vol. 36, pp. 7764-7772, 2009.
[23] M. Pavan, M., and M. Pelillo, “ Dominant sets and pairwise clustering”, IEEE Transactions on Pattern Analysis and Machine Learning, vol. 29, pp. 167–172, 2007.

Keywords
Text Summarization, PMI, K-Means, Khoja Stemmer, Similarity Measures, TFIDF, Pre-processing, Clusters, Sentence Ranking.