基于最小生成树的大规模数据分类模型及其MapReduce实现

黄 鑫; 罗 军

doi:10.12146/j.issn.2095-3135.201303011

基于最小生成树的大规模数据分类模型及其MapReduce实现

黄鑫,
罗军

A Classification Model for Massive Data Based on Minimum Spanning Tree with MapReduce Implementation

Huang xin,
Luo Jun

摘要

摘要: 数据的快速增长, 为我们提供了更多的信息, 然而, 也对传统信息获取技术提出了挑战。这篇论文提出了MCMM算法, 它是基于MapReduce的大规模数据分类模型的最小生成树（MST）的算法。它可以看做是介于传统的KNN方法和基于聚类分类方法之间的模型, 旨在克服这两种方法的不足并能处理大规模的数据。在这一模型中, 训练集作为有权重的无向完全图来处理。顶点是对象, 两点之间边的权重是对象间的距离。这一距离, 不同于欧几里得距离, 它是一个特定的距离度量。这样, 可以找到图中最小生成树集, 其中, 图中每棵树代表一个类。为了降低时间复杂度, 提取了每棵树中最具代表性的点来代表该树。这些压缩了的点集, 可以通过计算无标签对象和它们之间的距离, 来进行分类。MCMM模型基于MapReduce实现并且部署在Hadoop平台。该模型可扩展处理大规模的数据, 是因为Hadoop支持数据密集分布应用, 并且这些应用可以和数以千计的节点和数据一起运作。另外, MapReduce 和Hadoop能在由商品机组成的集群上很好的运行。MCMM模型使用云平台并且通过使用MapReduce 和Hadoop进行云计算是有益处的。实验采用的数据集包括从UCI数据库得到的真实数据和一些模拟数据, 实验使用了4000个集群。实验表明, MCMM模型在精确度和扩展性上优于KNN和其他一些经常使用的基础分类方法。

Abstract: The Rapid growth of data has provided us with more information, yet challenges the traditional techniques to extract the useful knowledge. In this paper, MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation is proposed. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. In this model, we treat the training set as weighted undirected complete graph. The vertices are objects and the weight of an edge between two objects is their distance, which could be a certain distance metric other than Euclidean distance. Then we find a minimum spanning tree forest of the graph, in which each tree represents a class. In order to reduce the computing time, we extract the most representative points of each tree to represent that tree. The shrunk point sets can be used for classification by computing the distances from unlabeled objects to them.MCMM model is implemented on Hadoop platform, using its MapReduce programming framework. Since Hadoop supports data intensive distributed applications and enables applications to work with thousands of nodes and petabytes of data, MCMM model is scalable to deal with massive data. In addition, MapReduce and Hadoop work well on cluster composed of commodity machines. Therefore there is no special need for particular hardware or architecture. This is actually the feature of cloud computing. MCMM model is used on cloud platform and could benefit from cloud computing by using Hadoop and MapReduce. Experiments had been carried out on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 cluster, installed with Hadoop. The results show that MCMM model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.

HTML全文

参考文献(0)

施引文献

资源附件(0)