Adaptive Data Replication Scheme Based on Access Count Prediction in Hadoop
| 2.99 Rewards Points
Hadoop, an open source implementation of the MapReduce framework, has been widely used for processing massive-scale data in parallel. Since Hadoop uses a distributed file system, called HDFS, the data locality problem often happens (i.e., a data block should be copied to the processing node when a processing node does not possess the data block in its local storage), and this problem leads to the decrease in performance. In this paper, we present an Adaptive Data Replication scheme based on Access count Prediction (ADRAP) in a Hadoop framework to address the data locality problem. The proposed data replication scheme predicts the next access count of data files using Lagrange’s interpolation with the previous data access count. With the predicted data access count, our adaptive data replication scheme determines whether it generates a new replica or it uses the loaded data as cache selectively, optimizing the replication factor. Furthermore, we provide a replica placement algorithm to improve data locality effectively. Performance evaluations show that our adaptive data replication scheme reduces the task completion time in the map phase by 9.6% on average, compared to the default data replication setting in Hadoop. With regard to data locality, our scheme offers the increase of node locality by 6.1% and the decrease of rack and rack-off locality by 45.6% and 56.5%, respectively.