一种基于聚类提升的不平衡数据分类算法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

广东高校优秀青年创新人才培养项目(2013LYM_0097);佛山市智能教育评价指标体系研究(DX20120220);佛山科学技术学院校级 科研项目。


A Clustering-Based Enhanced Classification Algorithm for Imbalanced Data
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    不平衡数据分类是机器学习研究领域中的一个热点问题。针对传统分类算法处理不平衡数据的少数类识 别率过低问题,文章提出了一种基于聚类的改进 AdaBoost 分类算法。算法首先进行基于聚类的欠采样,在多数类 样本上进行 K 均值聚类,之后提取聚类质心,与少数类样本数目一致的聚类质心和所有少数类样本组成新的平衡 训练集。为了避免少数类样本数量过少而使训练集过小导致分类精度下降,采用少数过采样技术过采样结合聚类 欠采样。然后,借鉴代价敏感学习思想,对 AdaBoost 算法的基分类器分类误差函数进行改进,赋予不同类别样本 非对称错分损失。实验结果表明,算法使模型训练样本具有较高的代表性,在保证总体分类性能的同时提高了少 数类的分类精度。

    Abstract:

    Imbalanced data exist widely in the real world and their classification is a hot topic in the field of machine learning. A clustering-based enhanced AdaBoost algorithm was proposed to improve the poor classification performance produced by the traditional algorithm in classifying the minority class of imbalanced datasets. The algorithm firstly constructs balanced training sets by the clustering-based undersampling, using K-means clustering to cluster the majority class and extract cluster centroids and then merge with all minority class instances to generate a new balanced training set. To avoid the declining of the classification accuracy caused by the shortage of training sets owing to too few minority class samples, SMOTE (Synthetic Minority Oversampling Technique) combining the clustering-based undersampling was used. Next, the misclassification loss function in the basic classifier of the AdaBoost algorithm was modified based on the costsensitive learning theory to assign asymmetric misclassification losses to samples of different classes. The experimental results show that, the proposed algorithm makes the model training samples more representative and greatly increases the classification accuracy of the minority class, keeping the overall classification performance.

    参考文献
    相似文献
    引证文献
引用本文

引文格式
胡小生,张润晶,钟 勇.一种基于聚类提升的不平衡数据分类算法 [J].集成技术,2014,3(2):35-41

Citing format
HU Xiaosheng, ZHANG Runjing, ZHONG Yong. A Clustering-Based Enhanced Classification Algorithm for Imbalanced Data[J]. Journal of Integration Technology,2014,3(2):35-41

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2014-04-01
  • 出版日期:
文章二维码