Published Date Extraction System A semi-supervised approach of extraction
Citation
Nitin Kumar, Abhishek Pradhan " Published Date Extraction System A semi-supervised approach of extraction ", International Journal of Engineering Trends and Technology (IJETT), V45(2),87-92 March 2017. ISSN:2231-5381. www.ijettjournal.org. published by seventh sense research group
Abstract
The need to extract a meaningful or relevant dates like published date from an unstructured document is a very vital cog in the wheel of information extraction and data mining field. The current approaches usage DOM (Document Object Model) manipulation for an HTML document or regex expression and rules from metadata which are not so accurate for different types of publication. The recent work in this area mainly focused on web pages and HTML pages with some good accuracy. Our approach took a leaf from those works for HTML, and along with that it extensively covers PDF document, Blog articles, and Websites. It supports several types of documents like News Articles, Patents, Scientific Articles/Journal in PDF format, Blogs, Websites and more. It also has the capabilities to learn over the period and feed the learnings back to the system as trained model. Our algorithm comprises of both supervised and unsupervised steps, and it uses natural language processing techniques.
References
[1] Chen, Z., Ma, J., Rui, H., & Ren, Z. (2010, January). Web Page Publication Date Extraction and Application. Journal of Computational Information Systems.
[2] Prokhorenkova, L. O., Prokhorenkov, P., Samosvat, E., & Serdyukov, P. (2016). Publication Date Prediction through Reverse Engineering of the Web. WSDM 2016.
[3] Garcia-Fernandez, A., Ligozat, A.-L., Dinarelli, M., & Bernhard, D. (2011). When was it Written? Automatically Determining Publication Dates.
[4] Lopez, P. (2009). GROBID: Combining Automatic BibliographicData Recognition and Term Extraction ForScholarship Publications. Research and Advanced Technology for Digital Libraries, 13th European Conference, ECDL 2009. Corfu, Greece.
[5] Lopez, P. (2010). Automatic Extraction and Resolution of Bibliographical References in Patent Documents. Advances in Multidisciplinary Retrieval, First Information Retrieval Facility Conference, IRFC 2010. Vienna, Austria.
[6] Mario Lipinski, K. Y. (2013). Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents. (pp. 385-386). ACM/IEEE-CS.
[7] Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. COLING `04 Proceedings of the 20th International conference on Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics.
Keywords
CRF modelling for Segment extraction, data mining, information extraction, published date