Abstract:The Rapid growth of data has provided us with more information, yet challenges the traditional techniques to extract the useful knowledge. In this paper, MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation is proposed. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. In this model, we treat the training set as weighted undirected complete graph. The vertices are objects and the weight of an edge between two objects is their distance, which could be a certain distance metric other than Euclidean distance. Then we find a minimum spanning tree forest of the graph, in which each tree represents a class. In order to reduce the computing time, we extract the most representative points of each tree to represent that tree. The shrunk point sets can be used for classification by computing the distances from unlabeled objects to them.MCMM model is implemented on Hadoop platform, using its MapReduce programming framework. Since Hadoop supports data intensive distributed applications and enables applications to work with thousands of nodes and petabytes of data, MCMM model is scalable to deal with massive data. In addition, MapReduce and Hadoop work well on cluster composed of commodity machines. Therefore there is no special need for particular hardware or architecture. This is actually the feature of cloud computing. MCMM model is used on cloud platform and could benefit from cloud computing by using Hadoop and MapReduce. Experiments had been carried out on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 cluster, installed with Hadoop. The results show that MCMM model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.