A Classification Model for Massive Data Based on Minimum Spanning Tree with MapReduce Implementation

doi:10.12146/j.issn.2095-3135.201303011

Home > Archive>Volume 2, Issue 2, 2013 >69-82. DOI:10.12146/j.issn.2095-3135.201303011

A Classification Model for Massive Data Based on Minimum Spanning Tree with MapReduce Implementation
DOI:
                        10.12146/j.issn.2095-3135.201303011
                    
CSTR:
                        
Author:
                        
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

The Rapid growth of data has provided us with more information, yet challenges the traditional techniques to extract the useful knowledge. In this paper, MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation is proposed. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. In this model, we treat the training set as weighted undirected complete graph. The vertices are objects and the weight of an edge between two objects is their distance, which could be a certain distance metric other than Euclidean distance. Then we find a minimum spanning tree forest of the graph, in which each tree represents a class. In order to reduce the computing time, we extract the most representative points of each tree to represent that tree. The shrunk point sets can be used for classification by computing the distances from unlabeled objects to them.MCMM model is implemented on Hadoop platform, using its MapReduce programming framework. Since Hadoop supports data intensive distributed applications and enables applications to work with thousands of nodes and petabytes of data, MCMM model is scalable to deal with massive data. In addition, MapReduce and Hadoop work well on cluster composed of commodity machines. Therefore there is no special need for particular hardware or architecture. This is actually the feature of cloud computing. MCMM model is used on cloud platform and could benefit from cloud computing by using Hadoop and MapReduce. Experiments had been carried out on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 cluster, installed with Hadoop. The results show that MCMM model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.

Reference

Cited by

Get Citation

Huang xin, Luo Jun. A Classification Model for Massive Data Based on Minimum Spanning Tree with MapReduce Implementation[J]. Journal of Integration Technology,2013,2(2):69-82

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:
Revised:
Adopted:
Online: January 07,2015
Published:

Home

About Journal

Editorial Team

Author Center

Peer Review

Reader Center

Ethics

Contact us

中文

Get Citation

Share

Article Metrics

History

Article QR Code