您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2023, Vol. 58 ›› Issue (12): 41-51.doi: 10.6040/j.issn.1671-9352.4.2022.3492

•   • 上一篇    下一篇

三支过采样的不平衡数据分类方法

方宇*(),郑胡宇,曹雪梅   

  1. 西南石油大学计算机科学学院, 四川 成都 610500
  • 收稿日期:2022-08-02 出版日期:2023-12-20 发布日期:2023-12-19
  • 通讯作者: 方宇 E-mail:fangyu@swpu.edu.cn
  • 作者简介:方宇(1983—),男,副教授,研究方向为粗糙集、三支决策、粒计算、代价敏感学习等. E-mail: fangyu@swpu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(62006200);中央引导地方科技发展专项项目(2021ZYD0003);2021年第二批产学合作协同育人项目;西南石油大学2021年一流本科课程培育建设项目(X2021YLKC035);西南石油大学研究生全英文课程建设项目(2020QY04)

Three-way over-sampling method for imbalanced data classification

Yu FANG*(),Huyu ZHENG,Xuemei CAO   

  1. School of Computer Science, Southwest Petroleum University, Chengdu 610500, Sichuan, China
  • Received:2022-08-02 Online:2023-12-20 Published:2023-12-19
  • Contact: Yu FANG E-mail:fangyu@swpu.edu.cn

摘要:

结合三支决策和合成少数过采样技术(synthetic minority over-sampling technique, SMOTE), 提出了一种新的采样方法—三支过采样(three-way over-sampling, 3WOS)。3WOS通过对所有样本构建三支决策模型, 选取该模型边界域中的样本作为关键样本进行SMOTE过采样, 从而有效缓解样本聚集和分离问题, 在一定程度上提高了分类器性能。该方法首先在少数类样本上应用三支决策和支持向量数据描述, 将所有样本数据进行三分; 其次, 找出所有关键样本的k个最近邻少数类样本, 并使用线性插值方式对每个关键样本合成新样本, 然后形成新的少数类样本; 最后, 将更新后的样本集用于训练分类器。实验结果表明, 3WOS方法比其他方法在基分类器上有较好的分类准确度、F-measure、G-mean和较少的代价值。

关键词: 不平衡数据, 三支决策, 支持向量数据描述, 合成少数过采样技术, 分类

Abstract:

This paper proposes a new sampling method combined with three-way decisions and SMOTE, referred to as three-way over-sampling (3WOS). 3WOS constructs the three-way decisions model for all samples and select the samples in the model boundary domain as key samples for SMOTE oversampling. Consequently, the problem of sample aggregation and separation is alleviated effectively. Moreover, the performance of the classifier is improved to a certain extent. Firstly, the method divides all samples into three parts according to the three-way decisions and support vector data description. Secondly, finding the k nearest neighbors of minority class for all key sample and using linear interpolation to synthesize new samples for each key sample to generate the new minority samples. Finally, the updated sample set is used to train the classifier. Experimental results show that the 3WOS method has better performance in classification accuracy, F-measure, G-mean and less cost on the base classifiers than other methods.

Key words: imbalanced data, three-way decision, support vector data description, SMOTE, classification

中图分类号: 

  • TP181

图1

3WS模型的整体框架"

图2

3WOS模型的整体框架"

表1

数据集的概要描述"

数据集 属性数量 样本数量 不平衡率 数据集 属性数量 样本数量 不平衡率
Optical Digits (OD) 64 5 620 9.1∶1 Arrhythmia (Arr) 279 452 17∶1
Satlmage (Sat) 36 6 435 9.3∶1 Wine Quality (WQ) 11 4 898 26∶1
Pen Digits (PD) 16 10 992 9.4∶1 Ozone Level (OL) 72 2 536 34∶1
Sick Euthyroid (SE) 25 3 163 9.8∶1 Mammography (Mam) 6 11 183 42∶1
Spectrometer (Spe) 93 531 11∶1 Abalone19 (A19) 8 4 177 130∶1
Scene (Sce) 294 2 407 13∶1

表2

混淆矩阵"

类别 预测结果
正类 负类
正类 TP FN
负类 FP TN

表3

代价矩阵"

类别 预测类
正类 负类
正类 C(+, +) C(-, +)
负类 C(+, -) C(-, -)

图3

Arrhythmia数据集上不同参数的分类性能比较"

图4

Satlmage数据集上不同参数的分类性能比较"

表4

算法的准确率对比"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.970 7 0.860 8 0.947 8 0.976 1 0.869 4 0.954 4 0.971 5 0.850 1 0.962 7 0.976 7 0.874 5 0.961 5
Sat 0.836 9 0.707 1 0.729 8 0.840 7 0.711 4 0.733 7 0.835 9 0.715 8 0.750 5 0.835 7 0.749 5 0.838 8
PD 0.984 5 0.678 4 0.929 3 0.987 4 0.680 6 0.933 2 0.981 2 0.709 2 0.901 3 0.988 1 0.9571 0.9757
SE 0.793 2 0.668 7 0.890 7 0.785 8 0.696 8 0.911 0 0.762 1 0.694 6 0.901 2 0.892 8 0.801 3 0.943 6
Spe 0.878 7 0.808 4 0.889 7 0.894 5 0.720 9 0.905 3 0.877 8 0.808 3 0.898 8 0.904 1 0.837 6 0.919 4
Sce 0.753 6 0.679 5 0.830 1 0.782 2 0.681 5 0.842 4 0.797 0 0.699 1 0.852 8 0.798 6 0.702 1 0.858 3
Arr 0.638 1 0.705 6 0.941 5 0.675 7 0.732 8 0.944 7 0.658 3 0.759 6 0.948 0 0.709 6 0.792 0 0.963 9
WQ 0.733 4 0.710 9 0.718 1 0.744 2 0.723 7 0.733 3 0.741 5 0.745 4 0.734 4 0.750 8 0.739 2 0.739 2
OL 0.849 6 0.763 5 0.893 5 0.870 6 0.818 7 0.907 3 0.930 9 0.867 8 0.949 2 0.881 4 0.810 8 0.911 7
Mam 0.892 6 0.800 1 0.876 0 0.901 1 0.805 3 0.883 1 0.912 5 0.713 5 0.897 8 0.915 6 0.815 9 0.899 1
A19 0.768 1 0.623 6 0.802 9 0.781 2 0.628 6 0.809 2 0.926 1 0.810 8 0.941 8 0.776 3 0.632 2 0.820 7

表5

算法的G-mean对比"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.970 7 0.860 0 0.929 4 0.976 2 0.868 7 0.933 2 0.971 2 0.849 6 0.899 2 0.976 6 0.873 6 0.975 7
Sat 0.828 7 0.670 9 0.711 4 0.833 8 0.673 2 0.716 8 0.819 5 0.681 9 0.738 4 0.825 1 0.709 3 0.830 3
PD 0.984 4 0.673 9 0.929 4 0.987 4 0.674 7 0.933 2 0.981 0 0.708 6 0.899 2 0.988 1 0.957 1 0.975 7
SE 0.781 3 0.624 3 0.889 6 0.770 6 0.655 0 0.910 4 0.737 0 0.649 2 0.899 3 0.891 2 0.792 3 0.943 8
Spe 0.879 8 0.805 1 0.887 9 0.891 8 0.699 7 0.904 1 0.876 5 0.787 8 0.894 4 0.899 3 0.828 7 0.919 1
Sce 0.745 0 0.679 5 0.829 0 0.775 5 0.689 8 0.840 6 0.796 0 0.697 4 0.851 0 0.795 1 0.697 7 0.856 6
Arr 0.627 7 0.671 0 0.942 1 0.645 0 0.708 1 0.944 1 0.627 3 0.730 6 0.947 1 0.667 2 0.763 1 0.962 9
WQ 0.729 0 0.705 3 0.717 1 0.741 5 0.719 3 0.732 5 0.734 8 0.741 5 0.732 4 0.747 8 0.738 2 0.738 8
OL 0.848 2 0.759 0 0.893 0 0.868 1 0.812 7 0.906 7 0.929 3 0.858 6 0.948 2 0.880 1 0.803 6 0.910 5
Mam 0.891 6 0.798 6 0.875 8 0.900 6 0.803 6 0.882 9 0.912 2 0.694 5 0.897 0 0.915 3 0.813 6 0.899 0
A1 9 0.758 4 0.545 1 0.802 6 0.761 9 0.544 5 0.807 2 0.924 0 0.789 8 0.9405 0.754 3 0.557 1 0.818 3

表6

算法的F-measure对比"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.971 1 0.865 9 0.947 8 0.976 0 0.873 9 0.954 8 0.971 2 0.854 2 0.963 5 0.977 1 0.879 3 0.961 9
Sat 0.853 9 0.760 6 0.767 7 0.856 2 0.765 4 0.769 8 0.8591 0.766 7 0.779 8 0.854 9 0.798 3 0.856 0
PD 0.984 8 0.650 8 0.929 8 0.987 5 0.674 7 0.933 1 0.981 7 0.717 4 0.907 0 0.988 3 0.957 2 0.975 5
SE 0.818 2 0.732 8 0.895 3 0.814 4 0.754 9 0.914 0 0.800 8 0.755 0 0.906 8 0.885 8 0.822 6 0.944 7
Spe 0.879 4 0.824 9 0.896 0 0.894 5 0.763 6 0.906 0 0.877 8 0.837 9 0.907 0 0.904 1 0.855 3 0.922 4
Sce 0.778 7 0.681 4 0.836 9 0.802 6 0.712 1 0.850 5 0.804 9 0.697 5 0.860 6 0.813 0 0.723 7 0.865 5
Arr 0.670 1 0.758 3 0.941 4 0.728 8 0.775 0 0.946 0 0.709 5 0.801 0 0.949 9 0.766 2 0.828 4 0.965 5
WQ 0.710 1 0.682 8 0.706 9 0.726 7 0.699 9 0.723 7 0.712 7 0.724 4 0.719 3 0.733 1 0.731 4 0.732 1
OL 0.856 9 0.781 5 0.896 8 0.878 6 0.834 8 0.910 3 0.934 5 0.882 7 0.951 3 0.886 9 0.829 1 0.915 8
Mam 0.887 8 0.806 0 0.874 1 0.898 0 0.813 9 0.880 6 0.914 1 0.753 1 0.901 6 0.913 8 0.824 9 0.898 0
A19 0.792 1 0.711 1 0.807 3 0.813 4 0.717 3 0.819 7 0.930 5 0.840 1 0.9447 0.811 0 0.717 7 0.831 4

图5

算法的误分类代价对比"

1 ZHENG Z H , WU X Y , SRIHARI R . Feature selection for text categorization on imbalanced data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6 (1): 80- 89.
2 HE H B , GARCIA E A . Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9): 1263- 1284.
3 ANAND A , PUGALENTHI G , FOGEL G B , et al. An approach for classification of highly imbalanced data using weighting and undersampling[J]. Amino Acids, 2010, 39 (5): 1385- 1391.
4 LIU L , CAI Y D , LU W C , et al. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection[J]. Biochemical and Biophysical Research Communications, 2009, 380 (2): 318- 322.
5 CHAWLA N V , BOWYER K W , HALL L O , et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16 (1): 321- 357.
6 HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the International Conference on Intelligent Computing. Berlin: Springer, 2005: 878-887.
7 HE H B, YANG B, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: IEEE, 2008: 1322-1328.
8 BARUA S , ISLAM M M , YAO X , et al. MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 26 (2): 405- 425.
9 祝团飞, 孙婧, 李宜洲, 等. BOS: 一种用于不平衡数据学习的边界过采样方法[J]. 四川大学学报(自然科学版), 2012, 49 (3): 553- 559.
ZHU Tuanfei , SUN Jing , LI Yizhou , et al. BOS: a borderline over-sampling method for imbalanced data learning[J]. Journal of Sichuan University (Natural Science Edition), 2012, 49 (3): 553- 559.
10 FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
11 LIU D , LIANG D C , WANG C C . A novel three-way decision model based on incomplete information system[J]. Knowledge-Based Systems, 2016, 91, 32- 45.
12 YAO Y Y. Three-way decision: an interpretation of rules in rough set theory[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2009: 642-649.
13 PAWLAK Z . Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11 (5): 341- 356.
14 PAWLAK Z . Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht: Kluwer Academic Publishers, 1992.
15 YAO Y Y. Decision-theoretic rough set models[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2007: 1-12.
16 YAO Y Y. An outline of a theory of three-way decisions[C]//Proceedings of the International Conference on Rough Sets and Current Trends in Computing. Berlin: Springer, 2012: 1-17.
17 YAN Y T , WU Z B , DU X Q , et al. A three-way decision ensemble method for imbalanced data oversampling[J]. International Journal of Approximate Reasoning, 2019, 107, 1- 16.
18 TAO X M , ZHENG Y J , CHEN W , et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Information Sciences, 2022, 588, 13- 51.
19 TAX D M , DUIN R P . Support vector data description[J]. Machine Learning, 2004, 54 (1): 45- 66.
20 FANG Y, CAO X M, WANG X, et al. Hypersphere neighborhood rough set for rapid attribute reduction[C]//Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer, 2022: 161-173
21 FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
22 JIANG H S, WANG H Y, HU W H, et al. Fast incremental SVDD learning algorithm with the Gaussian kernel[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019.
23 JIANG K , LU J , XIA K L . A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian Journal for Science and Engineering, 2016, 41 (8): 3255- 3266.
24 FANG Y , GAO C , YAO Y Y . Granularity-driven sequential three-way decisions: a cost-sensitive approach to classification[J]. Information Sciences, 2020, 507, 644- 664.
25 THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data[C]//Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010: 1-8.
[1] 何怡,邵亚斌,冯慧,郭瑞莲. 基于快速超粒方生成算法的分类器模型[J]. 《山东大学学报(理学版)》, 2026, 61(5): 65-78.
[2] 孙清,叶军,曾广财,宋苏洋,汪一心. 结合蝙蝠算法和紧密度改进的三支K-means算法[J]. 《山东大学学报(理学版)》, 2026, 61(1): 65-75.
[3] 钱文彬,彭嘉豪,蔡星星. 基于邻域粒度与三支决策的知识表示学习方法[J]. 《山东大学学报(理学版)》, 2025, 60(7): 94-103.
[4] 国栋凯,张钦然,李小南,易黄建. 基于新型阴影集的模糊C均值聚类算法[J]. 《山东大学学报(理学版)》, 2025, 60(1): 74-82.
[5] 纪杰,孙承杰,单丽莉,尚伯乐,林磊. 基于提示学习的电信网络诈骗案件分类方法[J]. 《山东大学学报(理学版)》, 2024, 59(7): 113-121.
[6] 黎超,廖薇. 基于医疗知识驱动的中文疾病文本分类模型[J]. 《山东大学学报(理学版)》, 2024, 59(7): 122-130.
[7] 范敏,秦琴,李金海. 基于三支因果力的邻域推荐算法[J]. 《山东大学学报(理学版)》, 2024, 59(5): 12-22.
[8] 朱金,付玉,管文瑞,王平心. 基于自然最近邻的样本扰动三支聚类[J]. 《山东大学学报(理学版)》, 2024, 59(5): 45-51.
[9] 方逢祺,吴伟志. 决策集值系统中的知识约简[J]. 《山东大学学报(理学版)》, 2024, 59(5): 82-89.
[10] 温欣,李德玉. 基于属性加权的ML-KNN方法[J]. 《山东大学学报(理学版)》, 2024, 59(3): 107-117.
[11] 王茜,张贤勇. 不完备邻域加权多粒度决策理论粗糙集及三支决策[J]. 《山东大学学报(理学版)》, 2023, 58(9): 94-104.
[12] 胡成祥,张莉,黄晓玲,王汇彬. 面向属性变化的动态邻域粗糙集知识更新方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 37-51.
[13] 王君宇,杨亚锋,薛静轩,李丽红. 可拓序贯三支决策模型及应用[J]. 《山东大学学报(理学版)》, 2023, 58(7): 67-79.
[14] 孟金旭,单鸿涛,黄润才,闫丰亭,李志伟,郑光远,刘一鸣,石昌通. 基于XLNet的双通道特征融合文本分类模型[J]. 《山东大学学报(理学版)》, 2023, 58(5): 36-45.
[15] 凡嘉琛,王平心,杨习贝. 基于三支决策的密度敏感谱聚类[J]. 《山东大学学报(理学版)》, 2023, 58(1): 59-66.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!