Big Data Preprocessing: Needs and Methods

Sandeep Dalal; Vandna Dahiya

doi:https://doi.org/10.14445/22315381/IJETT-V68I10P217

Research Article | Open Access | Download PDF

Volume 68 | Issue 10 | Year 2020 | Article Id. IJETT-V68I10P217 | DOI : https://doi.org/10.14445/22315381/IJETT-V68I10P217

Big Data Preprocessing: Needs and Methods

Sandeep Dalal, Vandna Dahiya

Citation :

Sandeep Dalal, Vandna Dahiya, "Big Data Preprocessing: Needs and Methods," International Journal of Engineering Trends and Technology (IJETT), vol. 68, no. 10, pp. 100-104, 2020. Crossref, https://doi.org/10.14445/22315381/IJETT-V68I10P217

Abstract

Big data is an assemblage of large and complex data that is difficult to process with the traditional DBMS tools. The scale, diversity, and complexity of this huge data demand new analytics techniques to extract useful and hidden value from it. Data must be prepared before starting mining as real data is sometimes not suitable for mining, and poor quality finishes in poor results. This paper presents the needs, various problems, and solutions for the preprocessing of big data.

Keywords

Big data, Discretization, MapReduce, Preprocessing.

References

[1] O. B. Ye, “Virtual Reality and Virtual Reality System Components”. France: Atlantis Press, 2013.
[1] Data Preparation for Data Mining, Dorian Pyle, 1999
[2] Singh S, Kubica J, Larsen SE, Sorokina D. “Parallel Large Scale Feature Selection for Logistic Regression. In: SIAM” International Conference on Data Mining (SDM). Sparks, Nevada: 2009. p., 1172–1183.
[3] Meena MJ, Chandran KR, Karthik A, Samuel AV. “An Enhanced ACO algorithm to Select Features for Text Categorization and its Parallelization”. Expert Syst Appl. 2012; 39(5):5861–871.
[4] Zhao Z, Zhang R, Cox J, Duling D, Sarle W. “Massively Parallel Feature Selection: An Approach Based on Variance Preservation”. Mach Learn. 2013; 92(1):195–220.
[5] Hu F, Li H, Lou H, Dai J. “A Parallel Oversampling Algorithm Based on NRSBoundary-SMOTE”. J Inf Comput Sci. 2014; 11(13):4655–665.
[6] Zhai J, Zhang S, Wang C. “The Classification of Imbalanced Large Data Sets Based on Mapreduce and Ensemble of Elm Classifiers”. Int J Mach Learn Cybern. 2016. doi:http://dx.doi.org/10.1007/s13042-015-0478-7
[7] Park S-h, Kim S-m, Ha Y-g. “Highway Traffic Accident Prediction Using Big Data Analysis”. J Supercomput. 2016. doi:http://dx.doi.org/10.1007/s11227-016-1624-z.
[8] Zhang Y, Yu J, Wang J. “Parallel Implementation of chi2 Algorithm in MapReduce Framework”. In: Human-Centered Computing - First International Conference, HCC. Germany: Springer: 2014. p. 890–9.
[9] Triguero I, Peralta D, Bacardit J, García S, Herrera F. “MRPR: A Mapreduce Solution for Prototype Reduction in Big Data Classification”. Neurocomputing. 2015; 150 Part A:331–45.
[10] Vandna Dahiya, Sandeep Dalal, “Big data Mining: Current Status and Future Prospects”, International Journal of Advanced Science and Technology, Volume 29, No 3, pp. 4659- 4670, 2020. Sandeep Dalal et al. / IJETT, 68(10), 100-104, 2020 104
[11] García S, Luengo J, Herrera F. “Tutorial On Practical Tips of the Most Influential Data Preprocessing Algorithms in Data Mining. Knowl-Based Syst”. 2016. doi:http://dx.doi.org/10.1016/j.knosys.2015.12.006.
[12] Tanupabrungsun S, Achalakul T. “Feature Reduction for Anomaly Detection in Manufacturing with Mapreduce GA/kNN”. In: 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS). Seoul: 2013. p. 639–44.
[13] Triguero I, Galar M, Vluymans S, Cornelis C, Bustince H, Herrera F, Saeys Y. “Evolutionary Under Sampling for Imbalanced Big Data Classification”. In: IEEE Congress on Evolutionary Computation, CEC. USA: IEEE: 2015. p. 715–22.
[14] Chen F, Jiang L. “A Parallel Algorithm for Data Cleansing in Incomplete Information Systems Using Mapreduce”. In: 10th International Conference on Computational Intelligence and Security (CIS). Kunmina, China: 2014. p. 273-277.
[15] Sandeep Dalal, Vandna Dahiya, “A Novel Technique - Absolute High Utility Itemset Mining (AHUIM) Algorithm for Big Data”, International Journal of Advanced Trends in Computer Science and Engineering (IJATCSE), Volume 9, Issue 5, 2020. pp 7451-7460.
[16] Zhang J, Wong JS, Pan Y, Li T. “A Parallel Matrix-Based Method for Computing Approximations in Incomplete Information Systems”. IEEE Trans Knowl Data Eng. 2015; 27(2):326–39.
[17] D. Jeffrey and S. Ghemawat. “MapReduce: Simplified Data Processing On Large Clusters”. Communications of the ACM, volume 51, pp. 107-113, Jan. 2008.
[18] Apache Hadoop Project, 2015. https://hadoop.apache.org
[19] Apache Spark: Lightning-fast cluster computing. https://spark.apache.org/
[20] Apache Flink. https://flink.apache.org/
[21] https://en.wikipedia.org/wiki/Apache_Flink
[22] https://pypi.org/project/mLib/