云计算异构环境下Hadoop性能分析
A Performance Analysis for Hadoop under Heterogeneous Cloud Computing Environments
-
摘要: 通过将虚拟化技术引入到传统的数据中心来实现计算资源的按需分配, 云计算服务正获得日益广泛的应用, 例如亚马逊所提供的弹性云计算服务EC2。另一方面, Hadoop作为MapReduce这一大规模数据的分布式并行计算模型的开源实现, 在学术界和工业界都获得了越来越多的研究和应用。当前的一个研究热点问题就是如何将云平台这一异构化的底层基础设施, 与Hadoop的上层计算模型有效结合起来, 利用云平台所提供的弹性资源来充分发挥Hadoop高扩展性、高容错性、低硬件配置的优点。在这篇论文中, 我们在异构云平台环境下进行了一系列的Hadoop性能测试和分析, 并指出在这一环境下, 由于虚拟机的高IO开销, 导致Hadoop的性能相比传统的纯粹物理节点集群急剧降低。我们的工作可以作为研究云计算异构环境下如何提高Hadoop性能的一个重要基础。Abstract: Cloud computing grows rapidly nowadays, which brings virtualization technology to traditional datacenters in order to implement service-on-demand of computing resources, such as Amazon’s Elastic Cloud Computing (EC2) Services. Hadoop is an open-source implementation of Google’s MapReduce, which is a distributed parallel computing model for large-scale dataset. Hadoop is gaining more and more focuses both in academy and industry. It is an open question that how to combine cloud computing infrastructures with Hadoop efficiently, i.e., making full use of the former’s elastic resources and the latter’s advantages of scalability, fault-tolerance and running on commodity hardware. In this paper, we carry out a series of experiments to evaluate and analyze the performance of Hadoop on our heterogeneous clouding computing testbed. We demonstrate that the performance of Hadoop is degraded under the scenario with high I/O overheads, compared with the traditional scenario where each node in a cluster is a physical machine. Our work can act as a basis for improving the performance of Hadoop under the cloud computing environments.