Comparing PMI-based to Cluster-based Arabic Single Document Summarization Approaches
Madeeh Nayer El-Gedawy. "Comparing PMI-based to Cluster-based Arabic Single Document Summarization Approaches", International Journal of Engineering Trends and Technology (IJETT), V11(8),379-383 May 2014. ISSN:2231-5381. published by seventh sense research group
In this paper, two extractive techniques are applied to handle Arabic Single Document Text summarization problem (SDS); the first uses a K-Means clustering approach and the other uses mutual information (MI) which is broadly used to measure the co-occurrence between two words in text mining. A successful Arabic document summarization algorithm should identify noteworthy sentences in the documents as accurately as possible. The terms used in the document (the distinct words) represent the document`s identity, and instead of Bag of Words (BoW); a Term-Sentence Matrix (TSM) is utilized. In the first approach, the text themes are extracted using K-Means then one sentence per Cluster is chosen to be part of the summary using TFIDF weights. In the other approach, the pointwise mutual information (PMI) is used to assign weights for each cell in the TSM. The matrix generated from this TSM, is used to extract a summary of the document. experimentations prove that the cluster-based methodology performs slightly better than the first one, but if the end user could tweak the summary percentage to appropriate level then, the PMI-based approach will be slightly better.
Text Summarization, PMI, K-Means, Khoja Stemmer, Similarity Measures, TFIDF, Pre-processing, Clusters, Sentence Ranking.