Extraction of Core Contents from Web Pages

  IJETT-book-cover  International Journal of Engineering Trends and Technology (IJETT)          
  
© 2014 by IJETT Journal
Volume-8 Number-9                          
Year of Publication : 2014
Authors : Sandeep Sirsat
  10.14445/22315381/IJETT-V8P285

Citation 

Sandeep Sirsat. "Extraction of Core Contents from Web Pages", International Journal of Engineering Trends and Technology(IJETT), V8(9),484-489 February 2014. ISSN:2231-5381. www.ijettjournal.org. published by seventh sense research group

Abstract

The information available on web pages mostly contains semi-structured text documents which are represented either in XML, or HTML, or XHTML format that lacks formatted document structure. The document does not discriminate between the text and the schema that represent the text. Also the amount of structure used to represent the text depends on the purpose and size of text document. No semantic is applied to semi-structured documents. This requires extracting core contents of text document to analyse words or sentences to generate useful knowledge. This paper discusses several techniques and approaches useful for extracting core content from semi-structured text documents and their merits and demerits.

References

[1] Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, Khaled Shaalan - A Survey of Web Information Extraction Systems , IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, T Vol. 18, no.10,pp. 1411-1428, Oct. 2006.
[2] Yan Guo, Huifeng Tang , Linhai Song, Yu Wang, Guodong Ding, ECON: An Approach to Extract Content from Web News Page, 2010 12th International Asia-Pacific Web Conference, 978-0-7695-4012-2/ 2010 IEEE.
[3] Shuyi Zheng, Ruihua Song, Ji-Rong Wen, Template-Independent News Extraction Based on Visual Consistency, American Association for Arti?cial Intelligence (www.aaai.org), 2007.
[4] J. Prasad and A. Paepcke, “Coreex: content extraction from online news articles,” in CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management. New York, NY, USA: ACM, 2008, pp. 1391–1392.
[5] Tanveer Siddiqui, U. S. Tiwari, Natural Language Processing and Information Retrieval, Oxford University Press Pages.
[6] Qiujun LAN- Extraction of News Content for Text Mining Based on Edit Distance Journal of Computational Information Systems 6:11 (2010) 3761-3777, November, 2010.
[7] Davi de Castro Reis, Paulo B. Golgher, Alberto H. F. Laender, Altigran S. da Silva, Automatic Web News Extraction Using Tree Edit Distance, ACM WWW2004, NewYork, USA, May17–22,2004.
[8] W. Chen. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40:135–158, 2001.
[9] K. Zhang, R. Statman, and D. Shasha. - On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133–139, 1992.
[10] S. M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6:184–186, Dec. 1977.
[11] W. Yang. Identifying syntactic differences between two programs. Softw. Pract. Exper., 21(7):739–755, 1991.

Keywords
Information Extraction, tag based, tree based, Natural Language Processing, Wrappers.