三支过采样的不平衡数据分类方法

doi:10.6040/j.issn.1671-9352.4.2022.3492

摘要/Abstract

摘要：

结合三支决策和合成少数过采样技术(synthetic minority over-sampling technique, SMOTE), 提出了一种新的采样方法—三支过采样(three-way over-sampling, 3WOS)。3WOS通过对所有样本构建三支决策模型, 选取该模型边界域中的样本作为关键样本进行SMOTE过采样, 从而有效缓解样本聚集和分离问题, 在一定程度上提高了分类器性能。该方法首先在少数类样本上应用三支决策和支持向量数据描述, 将所有样本数据进行三分; 其次, 找出所有关键样本的k个最近邻少数类样本, 并使用线性插值方式对每个关键样本合成新样本, 然后形成新的少数类样本; 最后, 将更新后的样本集用于训练分类器。实验结果表明, 3WOS方法比其他方法在基分类器上有较好的分类准确度、F-measure、G-mean和较少的代价值。

关键词: 不平衡数据, 三支决策, 支持向量数据描述, 合成少数过采样技术, 分类

Abstract:

This paper proposes a new sampling method combined with three-way decisions and SMOTE, referred to as three-way over-sampling (3WOS). 3WOS constructs the three-way decisions model for all samples and select the samples in the model boundary domain as key samples for SMOTE oversampling. Consequently, the problem of sample aggregation and separation is alleviated effectively. Moreover, the performance of the classifier is improved to a certain extent. Firstly, the method divides all samples into three parts according to the three-way decisions and support vector data description. Secondly, finding the k nearest neighbors of minority class for all key sample and using linear interpolation to synthesize new samples for each key sample to generate the new minority samples. Finally, the updated sample set is used to train the classifier. Experimental results show that the 3WOS method has better performance in classification accuracy, F-measure, G-mean and less cost on the base classifiers than other methods.

Key words: imbalanced data, three-way decision, support vector data description, SMOTE, classification

中图分类号:

TP181

方宇,郑胡宇,曹雪梅. 三支过采样的不平衡数据分类方法[J]. 《山东大学学报(理学版)》, 2023, 58(12): 41-51.

Yu FANG,Huyu ZHENG,Xuemei CAO. Three-way over-sampling method for imbalanced data classification[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(12): 41-51.

图/表 11

图1

图2

表1

表2

表3

图3

图4

表4

表5

表6

图5

参考文献 25

1	ZHENG Z H , WU X Y , SRIHARI R . Feature selection for text categorization on imbalanced data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6 (1): 80- 89.
2	HE H B , GARCIA E A . Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9): 1263- 1284.
3	ANAND A , PUGALENTHI G , FOGEL G B , et al. An approach for classification of highly imbalanced data using weighting and undersampling[J]. Amino Acids, 2010, 39 (5): 1385- 1391.
4	LIU L , CAI Y D , LU W C , et al. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection[J]. Biochemical and Biophysical Research Communications, 2009, 380 (2): 318- 322.
5	CHAWLA N V , BOWYER K W , HALL L O , et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16 (1): 321- 357.
6	HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the International Conference on Intelligent Computing. Berlin: Springer, 2005: 878-887.
7	HE H B, YANG B, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: IEEE, 2008: 1322-1328.
8	BARUA S , ISLAM M M , YAO X , et al. MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 26 (2): 405- 425.
9	祝团飞, 孙婧, 李宜洲, 等. BOS: 一种用于不平衡数据学习的边界过采样方法[J]. 四川大学学报(自然科学版), 2012, 49 (3): 553- 559.
	ZHU Tuanfei , SUN Jing , LI Yizhou , et al. BOS: a borderline over-sampling method for imbalanced data learning[J]. Journal of Sichuan University (Natural Science Edition), 2012, 49 (3): 553- 559.
10	FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
11	LIU D , LIANG D C , WANG C C . A novel three-way decision model based on incomplete information system[J]. Knowledge-Based Systems, 2016, 91, 32- 45.
12	YAO Y Y. Three-way decision: an interpretation of rules in rough set theory[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2009: 642-649.
13	PAWLAK Z . Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11 (5): 341- 356.
14	PAWLAK Z . Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht: Kluwer Academic Publishers, 1992.
15	YAO Y Y. Decision-theoretic rough set models[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2007: 1-12.
16	YAO Y Y. An outline of a theory of three-way decisions[C]//Proceedings of the International Conference on Rough Sets and Current Trends in Computing. Berlin: Springer, 2012: 1-17.
17	YAN Y T , WU Z B , DU X Q , et al. A three-way decision ensemble method for imbalanced data oversampling[J]. International Journal of Approximate Reasoning, 2019, 107, 1- 16.
18	TAO X M , ZHENG Y J , CHEN W , et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Information Sciences, 2022, 588, 13- 51.
19	TAX D M , DUIN R P . Support vector data description[J]. Machine Learning, 2004, 54 (1): 45- 66.
20	FANG Y, CAO X M, WANG X, et al. Hypersphere neighborhood rough set for rapid attribute reduction[C]//Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer, 2022: 161-173
21	FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
22	JIANG H S, WANG H Y, HU W H, et al. Fast incremental SVDD learning algorithm with the Gaussian kernel[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019.
23	JIANG K , LU J , XIA K L . A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian Journal for Science and Engineering, 2016, 41 (8): 3255- 3266.
24	FANG Y , GAO C , YAO Y Y . Granularity-driven sequential three-way decisions: a cost-sensitive approach to classification[J]. Information Sciences, 2020, 507, 644- 664.
25	THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data[C]//Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010: 1-8.

相关文章 15

[1]	王茜,张贤勇. 不完备邻域加权多粒度决策理论粗糙集及三支决策[J]. 《山东大学学报(理学版)》, 2023, 58(9): 94-104.
[2]	王君宇,杨亚锋,薛静轩,李丽红. 可拓序贯三支决策模型及应用[J]. 《山东大学学报(理学版)》, 2023, 58(7): 67-79.
[3]	胡成祥,张莉,黄晓玲,王汇彬. 面向属性变化的动态邻域粗糙集知识更新方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 37-51.
[4]	孟金旭,单鸿涛,黄润才,闫丰亭,李志伟,郑光远,刘一鸣,石昌通. 基于XLNet的双通道特征融合文本分类模型[J]. 《山东大学学报(理学版)》, 2023, 58(5): 36-45.
[5]	凡嘉琛,王平心,杨习贝. 基于三支决策的密度敏感谱聚类[J]. 《山东大学学报(理学版)》, 2023, 58(1): 59-66.
[6]	钱进,汤大伟,洪承鑫. 多粒度层次序贯三支决策模型研究[J]. 《山东大学学报(理学版)》, 2022, 57(9): 33-45.
[7]	巩增泰,他广朋. 直觉模糊集所诱导的软集语义及其三支决策[J]. 《山东大学学报(理学版)》, 2022, 57(8): 68-76.
[8]	薛占熬,李永祥,姚守倩,荆萌萌. 基于Bayesian直觉模糊粗糙集的数据分类方法[J]. 《山东大学学报(理学版)》, 2022, 57(5): 1-10.
[9]	施极,索中英. 基于区间数层次分析法的损失函数确定方法[J]. 《山东大学学报(理学版)》, 2022, 57(5): 28-37.
[10]	郑承宇,王新,王婷,邓亚萍,尹甜甜. 基于ALBERT-TextCNN模型的多标签医疗文本分类方法[J]. 《山东大学学报(理学版)》, 2022, 57(4): 21-29.
[11]	杨洁,罗天,李阳军. 基于TOPSIS的无标签序贯三支决策模型[J]. 《山东大学学报(理学版)》, 2022, 57(3): 41-48.
[12]	钟堃琰,刘惊雷. 基于低秩类间稀疏判别最小二乘回归的图像分类[J]. 《山东大学学报(理学版)》, 2022, 57(11): 89-101.
[13]	张斌艳,朱小飞,肖朝晖,黄贤英,吴洁. 基于半监督图神经网络的短文本分类[J]. 《山东大学学报(理学版)》, 2021, 56(5): 57-65.
[14]	李敏,杨亚锋,雷宇,李丽红. 基于可拓域变化代价最小的最优粒度选择[J]. 《山东大学学报(理学版)》, 2021, 56(2): 17-27.
[15]	阴爱英,林建洲,吴运兵,廖祥文. 融合图卷积神经网络的文本情感分类[J]. 《山东大学学报(理学版)》, 2021, 56(11): 15-23.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	属性数量	样本数量	不平衡率	数据集	属性数量	样本数量	不平衡率
Optical Digits (OD)	64	5 620	9.1∶1	Arrhythmia (Arr)	279	452	17∶1
Satlmage (Sat)	36	6 435	9.3∶1	Wine Quality (WQ)	11	4 898	26∶1
Pen Digits (PD)	16	10 992	9.4∶1	Ozone Level (OL)	72	2 536	34∶1
Sick Euthyroid (SE)	25	3 163	9.8∶1	Mammography (Mam)	6	11 183	42∶1
Spectrometer (Spe)	93	531	11∶1	Abalone19 (A19)	8	4 177	130∶1
Scene (Sce)	294	2 407	13∶1

类别	预测结果
类别	正类	负类
正类	TP	FN
负类	FP	TN

类别	预测类
类别	正类	负类
正类	C(+, +)	C(-, +)
负类	C(+, -)	C(-, -)

数据集	ROS			SMOTE			BSMO			3WOS
数据集	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR
OD	0.970 7	0.860 8	0.947 8	0.976 1	0.869 4	0.954 4	0.971 5	0.850 1	0.962 7	0.976 7	0.874 5	0.961 5
Sat	0.836 9	0.707 1	0.729 8	0.840 7	0.711 4	0.733 7	0.835 9	0.715 8	0.750 5	0.835 7	0.749 5	0.838 8
PD	0.984 5	0.678 4	0.929 3	0.987 4	0.680 6	0.933 2	0.981 2	0.709 2	0.901 3	0.988 1	0.9571	0.9757
SE	0.793 2	0.668 7	0.890 7	0.785 8	0.696 8	0.911 0	0.762 1	0.694 6	0.901 2	0.892 8	0.801 3	0.943 6
Spe	0.878 7	0.808 4	0.889 7	0.894 5	0.720 9	0.905 3	0.877 8	0.808 3	0.898 8	0.904 1	0.837 6	0.919 4
Sce	0.753 6	0.679 5	0.830 1	0.782 2	0.681 5	0.842 4	0.797 0	0.699 1	0.852 8	0.798 6	0.702 1	0.858 3
Arr	0.638 1	0.705 6	0.941 5	0.675 7	0.732 8	0.944 7	0.658 3	0.759 6	0.948 0	0.709 6	0.792 0	0.963 9
WQ	0.733 4	0.710 9	0.718 1	0.744 2	0.723 7	0.733 3	0.741 5	0.745 4	0.734 4	0.750 8	0.739 2	0.739 2
OL	0.849 6	0.763 5	0.893 5	0.870 6	0.818 7	0.907 3	0.930 9	0.867 8	0.949 2	0.881 4	0.810 8	0.911 7
Mam	0.892 6	0.800 1	0.876 0	0.901 1	0.805 3	0.883 1	0.912 5	0.713 5	0.897 8	0.915 6	0.815 9	0.899 1
A19	0.768 1	0.623 6	0.802 9	0.781 2	0.628 6	0.809 2	0.926 1	0.810 8	0.941 8	0.776 3	0.632 2	0.820 7

数据集	ROS			SMOTE			BSMO			3WOS
数据集	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR
OD	0.970 7	0.860 0	0.929 4	0.976 2	0.868 7	0.933 2	0.971 2	0.849 6	0.899 2	0.976 6	0.873 6	0.975 7
Sat	0.828 7	0.670 9	0.711 4	0.833 8	0.673 2	0.716 8	0.819 5	0.681 9	0.738 4	0.825 1	0.709 3	0.830 3
PD	0.984 4	0.673 9	0.929 4	0.987 4	0.674 7	0.933 2	0.981 0	0.708 6	0.899 2	0.988 1	0.957 1	0.975 7
SE	0.781 3	0.624 3	0.889 6	0.770 6	0.655 0	0.910 4	0.737 0	0.649 2	0.899 3	0.891 2	0.792 3	0.943 8
Spe	0.879 8	0.805 1	0.887 9	0.891 8	0.699 7	0.904 1	0.876 5	0.787 8	0.894 4	0.899 3	0.828 7	0.919 1
Sce	0.745 0	0.679 5	0.829 0	0.775 5	0.689 8	0.840 6	0.796 0	0.697 4	0.851 0	0.795 1	0.697 7	0.856 6
Arr	0.627 7	0.671 0	0.942 1	0.645 0	0.708 1	0.944 1	0.627 3	0.730 6	0.947 1	0.667 2	0.763 1	0.962 9
WQ	0.729 0	0.705 3	0.717 1	0.741 5	0.719 3	0.732 5	0.734 8	0.741 5	0.732 4	0.747 8	0.738 2	0.738 8
OL	0.848 2	0.759 0	0.893 0	0.868 1	0.812 7	0.906 7	0.929 3	0.858 6	0.948 2	0.880 1	0.803 6	0.910 5
Mam	0.891 6	0.798 6	0.875 8	0.900 6	0.803 6	0.882 9	0.912 2	0.694 5	0.897 0	0.915 3	0.813 6	0.899 0
A1 9	0.758 4	0.545 1	0.802 6	0.761 9	0.544 5	0.807 2	0.924 0	0.789 8	0.9405	0.754 3	0.557 1	0.818 3