Speed-up Extension to Hadoop System
Citation
Sayali Ashok Shivarkar. "Speed-up Extension to Hadoop System", International Journal of Engineering Trends and Technology (IJETT), V12(2),105-108 June 2014. ISSN:2231-5381. www.ijettjournal.org. published by seventh sense research group
Abstract
For storage and analysis of online or streaming data which is too big in size most organization are moving toward using Apaches Hadoop- HDFS. Applications like log processors, search engines etc. using Hadoop Map Reduce for computing and HDFS for storage. Hadoop is most popular for analysis, storage and processing very large data but there need to be lots of changes in hadoop system. Here problem of data storage and data processing try to solve which helps hadoop system to improve processing speed and reduce time to execute the task. Hadoop application requires streaming access to data files. During placement of data files default placement of Hadoop does not consider any data characteristics. If the related set of files is stored in the same set of nodes, the efficiency and access latency can be increased. Hadoop uses Map Reduce framework for implementing large-scale distributed computing on unpredicted data sets. There are potential duplicate computations being performed in this process. No mechanism is to identify such duplicate computations which increase processing time. Solution for above problem is to co-locate related files by considering content and using locality sensitive hashing algorithm which is a clustering based algorithm will try to co -locate related file streams to the same set of nodes without affecting the default scalability and fault tolerance properties of Hadoop and for avoiding duplicate computation processing mechanism is developed which store executed task with result and before execution of any task stored executed tasks are compared if task find then direct result will be provided . By storing related files in same cluster which improve data locality mechanism and avoiding repeated execution of task improves processing time, both helps to speed up execution of Hadoop.
References
[[1] Eltabakh, ”CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop, ” proceedings of the vldb endowment, june 2011, 4 (9), pp. 575-585.
[2] Yongqiang He; Rubao Lee; Yin Huai; Zheng Shao; Jain, N.; Xiaodong Zhang; Zhiwei Xu, ”RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems,” Data Engineering (ICDE), 2011 IEEE 27th International Conference on , vol., no., pp.1199,1208, 11-16 April 2011.
[3] Abad, C.L.; Yi Lu; Campbell, R.H., ”DARE: Adaptive Data Replication for Efficient Cluster Scheduling,” Cluster Computing (CLUSTER), 2011 IEEE International Conference on , vol., no., pp.159,168, 26-30 Sept. 2011.
[4] Zhendong Cheng; Zhongzhi Luan; You Meng; Yijing Xu; Depei Qian; Roy, A.; Ning Zhang; Gang Guan, ”ERMS: An Elastic Replication Management System for HDFS,” Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012 IEEE International Conference on , vol., no., pp.32,40, 24-28 Sept. 2012.
[5] Hui Jin; Xi Yang; Xian-He Sun; Raicu, I., ”ADAPT: Availability-Aware MapReduce Data Placement for Non-dedicated Distributed Computing,” Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on , vol., no., pp.516,525, 18-21 June 2012.
[6] Bo Dong; Jie Qiu; Qinghua Zheng; Xiao Zhong; Jingwei Li; Ying Li, ”A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files,” Services Computing (SCC), 2010 IEEE International Conference on , vol., no., pp.65,72, 5-10 July 2010.
[7] Pengju Shang; Qiangju Xiao; Jun Wang, ”DRAW: A new DatagRouping- AWare data placement scheme for data intensive applications with interest locality,” APMRC, 2012 Digest , vol., no., pp.1,8, Oct. 31 2012-Nov. 2 2012
[8] Jia Li; Kunhui Lin; Jingjin Wang, ”Design of the mass multimedia files storage architecture based on Hadoop,” Computer Science and Education (ICCSE), 2013 8th International Conference on , vol., no., pp.801,804, 26-28 April 2013
[9] Shvachko, K.; Hairong Kuang; Radia, S.; Chansler, R., ”The Hadoop Distributed File System,” Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on , vol., no., pp.1,10, 3-7 May 2010
[10] Kala Karun, A.; Chitharanjan, K., ”A review on hadoop HDFS infrastructure extensions,” Information and Communication Technologies (ICT), 2013 IEEE Conference on , vol., no., pp.132,137, 11-12 April 2013
[11] Kala, K.A.; Chitharanjan, K., ”Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop HDFS - An infrastructure extension,” Circuits, Power and Computing Technologies (ICCPCT), 2013 International Conference on , vol., no., pp.1243,1249, 20-21 March 2013
[12] Yaxiong Zhao; Jie Wu, ”Dache: A data aware caching for big-data applications using the MapReduce framework,” INFOCOM, 2013 Proceedings IEEE , vol., no., pp.35,39, 14-19 April 2013
[13] Juan Ramos, Using TF-IDF to Determine Word Relevance in Document Queries, Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855
[14] Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma, “Detecting NearDuplicates for Web Crawling”by google
[15] Tom white,”Hadoop definitive guide” o’Reilly ,yahoo,2010
[16] http://hadoop.apache.org/core
Keywords
Hadoop, Hdfs, MapReduce, Hashing Algorithm