您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2023, Vol. 58 ›› Issue (12): 41-51.doi: 10.6040/j.issn.1671-9352.4.2022.3492

•   • 上一篇    下一篇

三支过采样的不平衡数据分类方法

方宇*(),郑胡宇,曹雪梅   

  1. 西南石油大学计算机科学学院, 四川 成都 610500
  • 收稿日期:2022-08-02 出版日期:2023-12-20 发布日期:2023-12-19
  • 通讯作者: 方宇 E-mail:fangyu@swpu.edu.cn
  • 作者简介:方宇(1983—),男,副教授,研究方向为粗糙集、三支决策、粒计算、代价敏感学习等. E-mail: fangyu@swpu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(62006200);中央引导地方科技发展专项项目(2021ZYD0003);2021年第二批产学合作协同育人项目;西南石油大学2021年一流本科课程培育建设项目(X2021YLKC035);西南石油大学研究生全英文课程建设项目(2020QY04)

Three-way over-sampling method for imbalanced data classification

Yu FANG*(),Huyu ZHENG,Xuemei CAO   

  1. School of Computer Science, Southwest Petroleum University, Chengdu 610500, Sichuan, China
  • Received:2022-08-02 Online:2023-12-20 Published:2023-12-19
  • Contact: Yu FANG E-mail:fangyu@swpu.edu.cn

摘要:

结合三支决策和合成少数过采样技术(synthetic minority over-sampling technique, SMOTE), 提出了一种新的采样方法—三支过采样(three-way over-sampling, 3WOS)。3WOS通过对所有样本构建三支决策模型, 选取该模型边界域中的样本作为关键样本进行SMOTE过采样, 从而有效缓解样本聚集和分离问题, 在一定程度上提高了分类器性能。该方法首先在少数类样本上应用三支决策和支持向量数据描述, 将所有样本数据进行三分; 其次, 找出所有关键样本的k个最近邻少数类样本, 并使用线性插值方式对每个关键样本合成新样本, 然后形成新的少数类样本; 最后, 将更新后的样本集用于训练分类器。实验结果表明, 3WOS方法比其他方法在基分类器上有较好的分类准确度、F-measure、G-mean和较少的代价值。

关键词: 不平衡数据, 三支决策, 支持向量数据描述, 合成少数过采样技术, 分类

Abstract:

This paper proposes a new sampling method combined with three-way decisions and SMOTE, referred to as three-way over-sampling (3WOS). 3WOS constructs the three-way decisions model for all samples and select the samples in the model boundary domain as key samples for SMOTE oversampling. Consequently, the problem of sample aggregation and separation is alleviated effectively. Moreover, the performance of the classifier is improved to a certain extent. Firstly, the method divides all samples into three parts according to the three-way decisions and support vector data description. Secondly, finding the k nearest neighbors of minority class for all key sample and using linear interpolation to synthesize new samples for each key sample to generate the new minority samples. Finally, the updated sample set is used to train the classifier. Experimental results show that the 3WOS method has better performance in classification accuracy, F-measure, G-mean and less cost on the base classifiers than other methods.

Key words: imbalanced data, three-way decision, support vector data description, SMOTE, classification

中图分类号: 

  • TP181

图1

3WS模型的整体框架"

图2

3WOS模型的整体框架"

表1

数据集的概要描述"

数据集 属性数量 样本数量 不平衡率 数据集 属性数量 样本数量 不平衡率
Optical Digits (OD) 64 5 620 9.1∶1 Arrhythmia (Arr) 279 452 17∶1
Satlmage (Sat) 36 6 435 9.3∶1 Wine Quality (WQ) 11 4 898 26∶1
Pen Digits (PD) 16 10 992 9.4∶1 Ozone Level (OL) 72 2 536 34∶1
Sick Euthyroid (SE) 25 3 163 9.8∶1 Mammography (Mam) 6 11 183 42∶1
Spectrometer (Spe) 93 531 11∶1 Abalone19 (A19) 8 4 177 130∶1
Scene (Sce) 294 2 407 13∶1

表2

混淆矩阵"

类别 预测结果
正类 负类
正类 TP FN
负类 FP TN

表3

代价矩阵"

类别 预测类
正类 负类
正类 C(+, +) C(-, +)
负类 C(+, -) C(-, -)

图3

Arrhythmia数据集上不同参数的分类性能比较"

图4

Satlmage数据集上不同参数的分类性能比较"

表4

算法的准确率对比"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.970 7 0.860 8 0.947 8 0.976 1 0.869 4 0.954 4 0.971 5 0.850 1 0.962 7 0.976 7 0.874 5 0.961 5
Sat 0.836 9 0.707 1 0.729 8 0.840 7 0.711 4 0.733 7 0.835 9 0.715 8 0.750 5 0.835 7 0.749 5 0.838 8
PD 0.984 5 0.678 4 0.929 3 0.987 4 0.680 6 0.933 2 0.981 2 0.709 2 0.901 3 0.988 1 0.9571 0.9757
SE 0.793 2 0.668 7 0.890 7 0.785 8 0.696 8 0.911 0 0.762 1 0.694 6 0.901 2 0.892 8 0.801 3 0.943 6
Spe 0.878 7 0.808 4 0.889 7 0.894 5 0.720 9 0.905 3 0.877 8 0.808 3 0.898 8 0.904 1 0.837 6 0.919 4
Sce 0.753 6 0.679 5 0.830 1 0.782 2 0.681 5 0.842 4 0.797 0 0.699 1 0.852 8 0.798 6 0.702 1 0.858 3
Arr 0.638 1 0.705 6 0.941 5 0.675 7 0.732 8 0.944 7 0.658 3 0.759 6 0.948 0 0.709 6 0.792 0 0.963 9
WQ 0.733 4 0.710 9 0.718 1 0.744 2 0.723 7 0.733 3 0.741 5 0.745 4 0.734 4 0.750 8 0.739 2 0.739 2
OL 0.849 6 0.763 5 0.893 5 0.870 6 0.818 7 0.907 3 0.930 9 0.867 8 0.949 2 0.881 4 0.810 8 0.911 7
Mam 0.892 6 0.800 1 0.876 0 0.901 1 0.805 3 0.883 1 0.912 5 0.713 5 0.897 8 0.915 6 0.815 9 0.899 1
A19 0.768 1 0.623 6 0.802 9 0.781 2 0.628 6 0.809 2 0.926 1 0.810 8 0.941 8 0.776 3 0.632 2 0.820 7

表5

算法的G-mean对比"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.970 7 0.860 0 0.929 4 0.976 2 0.868 7 0.933 2 0.971 2 0.849 6 0.899 2 0.976 6 0.873 6 0.975 7
Sat 0.828 7 0.670 9 0.711 4 0.833 8 0.673 2 0.716 8 0.819 5 0.681 9 0.738 4 0.825 1 0.709 3 0.830 3
PD 0.984 4 0.673 9 0.929 4 0.987 4 0.674 7 0.933 2 0.981 0 0.708 6 0.899 2 0.988 1 0.957 1 0.975 7
SE 0.781 3 0.624 3 0.889 6 0.770 6 0.655 0 0.910 4 0.737 0 0.649 2 0.899 3 0.891 2 0.792 3 0.943 8
Spe 0.879 8 0.805 1 0.887 9 0.891 8 0.699 7 0.904 1 0.876 5 0.787 8 0.894 4 0.899 3 0.828 7 0.919 1
Sce 0.745 0 0.679 5 0.829 0 0.775 5 0.689 8 0.840 6 0.796 0 0.697 4 0.851 0 0.795 1 0.697 7 0.856 6
Arr 0.627 7 0.671 0 0.942 1 0.645 0 0.708 1 0.944 1 0.627 3 0.730 6 0.947 1 0.667 2 0.763 1 0.962 9
WQ 0.729 0 0.705 3 0.717 1 0.741 5 0.719 3 0.732 5 0.734 8 0.741 5 0.732 4 0.747 8 0.738 2 0.738 8
OL 0.848 2 0.759 0 0.893 0 0.868 1 0.812 7 0.906 7 0.929 3 0.858 6 0.948 2 0.880 1 0.803 6 0.910 5
Mam 0.891 6 0.798 6 0.875 8 0.900 6 0.803 6 0.882 9 0.912 2 0.694 5 0.897 0 0.915 3 0.813 6 0.899 0
A1 9 0.758 4 0.545 1 0.802 6 0.761 9 0.544 5 0.807 2 0.924 0 0.789 8 0.9405 0.754 3 0.557 1 0.818 3

表6

算法的F-measure对比"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.971 1 0.865 9 0.947 8 0.976 0 0.873 9 0.954 8 0.971 2 0.854 2 0.963 5 0.977 1 0.879 3 0.961 9
Sat 0.853 9 0.760 6 0.767 7 0.856 2 0.765 4 0.769 8 0.8591 0.766 7 0.779 8 0.854 9 0.798 3 0.856 0
PD 0.984 8 0.650 8 0.929 8 0.987 5 0.674 7 0.933 1 0.981 7 0.717 4 0.907 0 0.988 3 0.957 2 0.975 5
SE 0.818 2 0.732 8 0.895 3 0.814 4 0.754 9 0.914 0 0.800 8 0.755 0 0.906 8 0.885 8 0.822 6 0.944 7
Spe 0.879 4 0.824 9 0.896 0 0.894 5 0.763 6 0.906 0 0.877 8 0.837 9 0.907 0 0.904 1 0.855 3 0.922 4
Sce 0.778 7 0.681 4 0.836 9 0.802 6 0.712 1 0.850 5 0.804 9 0.697 5 0.860 6 0.813 0 0.723 7 0.865 5
Arr 0.670 1 0.758 3 0.941 4 0.728 8 0.775 0 0.946 0 0.709 5 0.801 0 0.949 9 0.766 2 0.828 4 0.965 5
WQ 0.710 1 0.682 8 0.706 9 0.726 7 0.699 9 0.723 7 0.712 7 0.724 4 0.719 3 0.733 1 0.731 4 0.732 1
OL 0.856 9 0.781 5 0.896 8 0.878 6 0.834 8 0.910 3 0.934 5 0.882 7 0.951 3 0.886 9 0.829 1 0.915 8
Mam 0.887 8 0.806 0 0.874 1 0.898 0 0.813 9 0.880 6 0.914 1 0.753 1 0.901 6 0.913 8 0.824 9 0.898 0
A19 0.792 1 0.711 1 0.807 3 0.813 4 0.717 3 0.819 7 0.930 5 0.840 1 0.9447 0.811 0 0.717 7 0.831 4

图5

算法的误分类代价对比"

1 ZHENG Z H , WU X Y , SRIHARI R . Feature selection for text categorization on imbalanced data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6 (1): 80- 89.
2 HE H B , GARCIA E A . Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9): 1263- 1284.
3 ANAND A , PUGALENTHI G , FOGEL G B , et al. An approach for classification of highly imbalanced data using weighting and undersampling[J]. Amino Acids, 2010, 39 (5): 1385- 1391.
4 LIU L , CAI Y D , LU W C , et al. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection[J]. Biochemical and Biophysical Research Communications, 2009, 380 (2): 318- 322.
5 CHAWLA N V , BOWYER K W , HALL L O , et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16 (1): 321- 357.
6 HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the International Conference on Intelligent Computing. Berlin: Springer, 2005: 878-887.
7 HE H B, YANG B, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: IEEE, 2008: 1322-1328.
8 BARUA S , ISLAM M M , YAO X , et al. MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 26 (2): 405- 425.
9 祝团飞, 孙婧, 李宜洲, 等. BOS: 一种用于不平衡数据学习的边界过采样方法[J]. 四川大学学报(自然科学版), 2012, 49 (3): 553- 559.
ZHU Tuanfei , SUN Jing , LI Yizhou , et al. BOS: a borderline over-sampling method for imbalanced data learning[J]. Journal of Sichuan University (Natural Science Edition), 2012, 49 (3): 553- 559.
10 FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
11 LIU D , LIANG D C , WANG C C . A novel three-way decision model based on incomplete information system[J]. Knowledge-Based Systems, 2016, 91, 32- 45.
12 YAO Y Y. Three-way decision: an interpretation of rules in rough set theory[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2009: 642-649.
13 PAWLAK Z . Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11 (5): 341- 356.
14 PAWLAK Z . Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht: Kluwer Academic Publishers, 1992.
15 YAO Y Y. Decision-theoretic rough set models[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2007: 1-12.
16 YAO Y Y. An outline of a theory of three-way decisions[C]//Proceedings of the International Conference on Rough Sets and Current Trends in Computing. Berlin: Springer, 2012: 1-17.
17 YAN Y T , WU Z B , DU X Q , et al. A three-way decision ensemble method for imbalanced data oversampling[J]. International Journal of Approximate Reasoning, 2019, 107, 1- 16.
18 TAO X M , ZHENG Y J , CHEN W , et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Information Sciences, 2022, 588, 13- 51.
19 TAX D M , DUIN R P . Support vector data description[J]. Machine Learning, 2004, 54 (1): 45- 66.
20 FANG Y, CAO X M, WANG X, et al. Hypersphere neighborhood rough set for rapid attribute reduction[C]//Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer, 2022: 161-173
21 FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
22 JIANG H S, WANG H Y, HU W H, et al. Fast incremental SVDD learning algorithm with the Gaussian kernel[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019.
23 JIANG K , LU J , XIA K L . A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian Journal for Science and Engineering, 2016, 41 (8): 3255- 3266.
24 FANG Y , GAO C , YAO Y Y . Granularity-driven sequential three-way decisions: a cost-sensitive approach to classification[J]. Information Sciences, 2020, 507, 644- 664.
25 THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data[C]//Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010: 1-8.
[1] 王茜,张贤勇. 不完备邻域加权多粒度决策理论粗糙集及三支决策[J]. 《山东大学学报(理学版)》, 2023, 58(9): 94-104.
[2] 王君宇,杨亚锋,薛静轩,李丽红. 可拓序贯三支决策模型及应用[J]. 《山东大学学报(理学版)》, 2023, 58(7): 67-79.
[3] 胡成祥,张莉,黄晓玲,王汇彬. 面向属性变化的动态邻域粗糙集知识更新方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 37-51.
[4] 孟金旭,单鸿涛,黄润才,闫丰亭,李志伟,郑光远,刘一鸣,石昌通. 基于XLNet的双通道特征融合文本分类模型[J]. 《山东大学学报(理学版)》, 2023, 58(5): 36-45.
[5] 凡嘉琛,王平心,杨习贝. 基于三支决策的密度敏感谱聚类[J]. 《山东大学学报(理学版)》, 2023, 58(1): 59-66.
[6] 钱进,汤大伟,洪承鑫. 多粒度层次序贯三支决策模型研究[J]. 《山东大学学报(理学版)》, 2022, 57(9): 33-45.
[7] 巩增泰,他广朋. 直觉模糊集所诱导的软集语义及其三支决策[J]. 《山东大学学报(理学版)》, 2022, 57(8): 68-76.
[8] 薛占熬,李永祥,姚守倩,荆萌萌. 基于Bayesian直觉模糊粗糙集的数据分类方法[J]. 《山东大学学报(理学版)》, 2022, 57(5): 1-10.
[9] 施极,索中英. 基于区间数层次分析法的损失函数确定方法[J]. 《山东大学学报(理学版)》, 2022, 57(5): 28-37.
[10] 郑承宇,王新,王婷,邓亚萍,尹甜甜. 基于ALBERT-TextCNN模型的多标签医疗文本分类方法[J]. 《山东大学学报(理学版)》, 2022, 57(4): 21-29.
[11] 杨洁,罗天,李阳军. 基于TOPSIS的无标签序贯三支决策模型[J]. 《山东大学学报(理学版)》, 2022, 57(3): 41-48.
[12] 钟堃琰,刘惊雷. 基于低秩类间稀疏判别最小二乘回归的图像分类[J]. 《山东大学学报(理学版)》, 2022, 57(11): 89-101.
[13] 张斌艳,朱小飞,肖朝晖,黄贤英,吴洁. 基于半监督图神经网络的短文本分类[J]. 《山东大学学报(理学版)》, 2021, 56(5): 57-65.
[14] 李敏,杨亚锋,雷宇,李丽红. 基于可拓域变化代价最小的最优粒度选择[J]. 《山东大学学报(理学版)》, 2021, 56(2): 17-27.
[15] 阴爱英,林建洲,吴运兵,廖祥文. 融合图卷积神经网络的文本情感分类[J]. 《山东大学学报(理学版)》, 2021, 56(11): 15-23.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 杨军. 金属基纳米材料表征和纳米结构调控[J]. 山东大学学报(理学版), 2013, 48(1): 1 -22 .
[2] 何海伦, 陈秀兰*. 变性剂和缓冲系统对适冷蛋白酶MCP-01和中温蛋白酶BP-01构象影响的圆二色光谱分析何海伦, 陈秀兰*[J]. 山东大学学报(理学版), 2013, 48(1): 23 -29 .
[3] 赵君1,赵晶2,樊廷俊1*,袁文鹏1,3,张铮1,丛日山1. 水溶性海星皂苷的分离纯化及其抗肿瘤活性研究[J]. J4, 2013, 48(1): 30 -35 .
[4] 孙小婷1,靳岚2*. DOSY在寡糖混合物分析中的应用[J]. J4, 2013, 48(1): 43 -45 .
[5] 罗斯特,卢丽倩,崔若飞,周伟伟,李增勇*. Monte-Carlo仿真酒精特征波长光子在皮肤中的传输规律及光纤探头设计[J]. J4, 2013, 48(1): 46 -50 .
[6] 杨伦,徐正刚,王慧*,陈其美,陈伟,胡艳霞,石元,祝洪磊,曾勇庆*. RNA干扰沉默PID1基因在C2C12细胞中表达的研究[J]. J4, 2013, 48(1): 36 -42 .
[7] 冒爱琴1, 2, 杨明君2, 3, 俞海云2, 张品1, 潘仁明1*. 五氟乙烷灭火剂高温热解机理研究[J]. J4, 2013, 48(1): 51 -55 .
[8] 杨莹,江龙*,索新丽. 容度空间上保费泛函的Choquet积分表示及相关性质[J]. J4, 2013, 48(1): 78 -82 .
[9] 李永明1, 丁立旺2. PA误差下半参数回归模型估计的r-阶矩相合[J]. J4, 2013, 48(1): 83 -88 .
[10] 董伟伟. 一种具有独立子系统的决策单元DEA排序新方法[J]. J4, 2013, 48(1): 89 -92 .