三支过采样的不平衡数据分类方法

doi:10.6040/j.issn.1671-9352.4.2022.3492

摘要/Abstract

摘要：

结合三支决策和合成少数过采样技术(synthetic minority over-sampling technique, SMOTE), 提出了一种新的采样方法—三支过采样(three-way over-sampling, 3WOS)。3WOS通过对所有样本构建三支决策模型, 选取该模型边界域中的样本作为关键样本进行SMOTE过采样, 从而有效缓解样本聚集和分离问题, 在一定程度上提高了分类器性能。该方法首先在少数类样本上应用三支决策和支持向量数据描述, 将所有样本数据进行三分; 其次, 找出所有关键样本的k个最近邻少数类样本, 并使用线性插值方式对每个关键样本合成新样本, 然后形成新的少数类样本; 最后, 将更新后的样本集用于训练分类器。实验结果表明, 3WOS方法比其他方法在基分类器上有较好的分类准确度、F-measure、G-mean和较少的代价值。

关键词: 不平衡数据, 三支决策, 支持向量数据描述, 合成少数过采样技术, 分类

Abstract:

This paper proposes a new sampling method combined with three-way decisions and SMOTE, referred to as three-way over-sampling (3WOS). 3WOS constructs the three-way decisions model for all samples and select the samples in the model boundary domain as key samples for SMOTE oversampling. Consequently, the problem of sample aggregation and separation is alleviated effectively. Moreover, the performance of the classifier is improved to a certain extent. Firstly, the method divides all samples into three parts according to the three-way decisions and support vector data description. Secondly, finding the k nearest neighbors of minority class for all key sample and using linear interpolation to synthesize new samples for each key sample to generate the new minority samples. Finally, the updated sample set is used to train the classifier. Experimental results show that the 3WOS method has better performance in classification accuracy, F-measure, G-mean and less cost on the base classifiers than other methods.

Key words: imbalanced data, three-way decision, support vector data description, SMOTE, classification

中图分类号:

TP181

方宇,郑胡宇,曹雪梅. 三支过采样的不平衡数据分类方法[J]. 《山东大学学报(理学版)》, 2023, 58(12): 41-51.

Yu FANG,Huyu ZHENG,Xuemei CAO. Three-way over-sampling method for imbalanced data classification[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(12): 41-51.

图/表 11

图1

图2

表1

表2

表3

图3

图4

表4

表5

表6

图5

参考文献 25

1	ZHENG Z H , WU X Y , SRIHARI R . Feature selection for text categorization on imbalanced data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6 (1): 80- 89.
2	HE H B , GARCIA E A . Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9): 1263- 1284.
3	ANAND A , PUGALENTHI G , FOGEL G B , et al. An approach for classification of highly imbalanced data using weighting and undersampling[J]. Amino Acids, 2010, 39 (5): 1385- 1391.
4	LIU L , CAI Y D , LU W C , et al. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection[J]. Biochemical and Biophysical Research Communications, 2009, 380 (2): 318- 322.
5	CHAWLA N V , BOWYER K W , HALL L O , et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16 (1): 321- 357.
6	HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the International Conference on Intelligent Computing. Berlin: Springer, 2005: 878-887.
7	HE H B, YANG B, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: IEEE, 2008: 1322-1328.
8	BARUA S , ISLAM M M , YAO X , et al. MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 26 (2): 405- 425.
9	祝团飞, 孙婧, 李宜洲, 等. BOS: 一种用于不平衡数据学习的边界过采样方法[J]. 四川大学学报(自然科学版), 2012, 49 (3): 553- 559.
	ZHU Tuanfei , SUN Jing , LI Yizhou , et al. BOS: a borderline over-sampling method for imbalanced data learning[J]. Journal of Sichuan University (Natural Science Edition), 2012, 49 (3): 553- 559.
10	FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
11	LIU D , LIANG D C , WANG C C . A novel three-way decision model based on incomplete information system[J]. Knowledge-Based Systems, 2016, 91, 32- 45.
12	YAO Y Y. Three-way decision: an interpretation of rules in rough set theory[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2009: 642-649.
13	PAWLAK Z . Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11 (5): 341- 356.
14	PAWLAK Z . Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht: Kluwer Academic Publishers, 1992.
15	YAO Y Y. Decision-theoretic rough set models[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2007: 1-12.
16	YAO Y Y. An outline of a theory of three-way decisions[C]//Proceedings of the International Conference on Rough Sets and Current Trends in Computing. Berlin: Springer, 2012: 1-17.
17	YAN Y T , WU Z B , DU X Q , et al. A three-way decision ensemble method for imbalanced data oversampling[J]. International Journal of Approximate Reasoning, 2019, 107, 1- 16.
18	TAO X M , ZHENG Y J , CHEN W , et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Information Sciences, 2022, 588, 13- 51.
19	TAX D M , DUIN R P . Support vector data description[J]. Machine Learning, 2004, 54 (1): 45- 66.
20	FANG Y, CAO X M, WANG X, et al. Hypersphere neighborhood rough set for rapid attribute reduction[C]//Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer, 2022: 161-173
21	FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
22	JIANG H S, WANG H Y, HU W H, et al. Fast incremental SVDD learning algorithm with the Gaussian kernel[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019.
23	JIANG K , LU J , XIA K L . A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian Journal for Science and Engineering, 2016, 41 (8): 3255- 3266.
24	FANG Y , GAO C , YAO Y Y . Granularity-driven sequential three-way decisions: a cost-sensitive approach to classification[J]. Information Sciences, 2020, 507, 644- 664.
25	THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data[C]//Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010: 1-8.

相关文章 15

[1]	何怡,邵亚斌,冯慧,郭瑞莲. 基于快速超粒方生成算法的分类器模型[J]. 《山东大学学报(理学版)》, 2026, 61(5): 65-78.
[2]	孙清,叶军,曾广财,宋苏洋,汪一心. 结合蝙蝠算法和紧密度改进的三支K-means算法[J]. 《山东大学学报(理学版)》, 2026, 61(1): 65-75.
[3]	钱文彬,彭嘉豪,蔡星星. 基于邻域粒度与三支决策的知识表示学习方法[J]. 《山东大学学报(理学版)》, 2025, 60(7): 94-103.
[4]	国栋凯,张钦然,李小南,易黄建. 基于新型阴影集的模糊C均值聚类算法[J]. 《山东大学学报(理学版)》, 2025, 60(1): 74-82.
[5]	纪杰,孙承杰,单丽莉,尚伯乐,林磊. 基于提示学习的电信网络诈骗案件分类方法[J]. 《山东大学学报(理学版)》, 2024, 59(7): 113-121.
[6]	黎超,廖薇. 基于医疗知识驱动的中文疾病文本分类模型[J]. 《山东大学学报(理学版)》, 2024, 59(7): 122-130.
[7]	范敏,秦琴,李金海. 基于三支因果力的邻域推荐算法[J]. 《山东大学学报(理学版)》, 2024, 59(5): 12-22.
[8]	朱金,付玉,管文瑞,王平心. 基于自然最近邻的样本扰动三支聚类[J]. 《山东大学学报(理学版)》, 2024, 59(5): 45-51.
[9]	方逢祺,吴伟志. 决策集值系统中的知识约简[J]. 《山东大学学报(理学版)》, 2024, 59(5): 82-89.
[10]	温欣,李德玉. 基于属性加权的ML-KNN方法[J]. 《山东大学学报(理学版)》, 2024, 59(3): 107-117.
[11]	王茜,张贤勇. 不完备邻域加权多粒度决策理论粗糙集及三支决策[J]. 《山东大学学报(理学版)》, 2023, 58(9): 94-104.
[12]	胡成祥,张莉,黄晓玲,王汇彬. 面向属性变化的动态邻域粗糙集知识更新方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 37-51.
[13]	王君宇,杨亚锋,薛静轩,李丽红. 可拓序贯三支决策模型及应用[J]. 《山东大学学报(理学版)》, 2023, 58(7): 67-79.
[14]	孟金旭,单鸿涛,黄润才,闫丰亭,李志伟,郑光远,刘一鸣,石昌通. 基于XLNet的双通道特征融合文本分类模型[J]. 《山东大学学报(理学版)》, 2023, 58(5): 36-45.
[15]	凡嘉琛,王平心,杨习贝. 基于三支决策的密度敏感谱聚类[J]. 《山东大学学报(理学版)》, 2023, 58(1): 59-66.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	属性数量	样本数量	不平衡率	数据集	属性数量	样本数量	不平衡率
Optical Digits (OD)	64	5 620	9.1∶1	Arrhythmia (Arr)	279	452	17∶1
Satlmage (Sat)	36	6 435	9.3∶1	Wine Quality (WQ)	11	4 898	26∶1
Pen Digits (PD)	16	10 992	9.4∶1	Ozone Level (OL)	72	2 536	34∶1
Sick Euthyroid (SE)	25	3 163	9.8∶1	Mammography (Mam)	6	11 183	42∶1
Spectrometer (Spe)	93	531	11∶1	Abalone19 (A19)	8	4 177	130∶1
Scene (Sce)	294	2 407	13∶1

类别	预测结果
类别	正类	负类
正类	TP	FN
负类	FP	TN

类别	预测类
类别	正类	负类
正类	C(+, +)	C(-, +)
负类	C(+, -)	C(-, -)

数据集	ROS			SMOTE			BSMO			3WOS
数据集	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR
OD	0.970 7	0.860 8	0.947 8	0.976 1	0.869 4	0.954 4	0.971 5	0.850 1	0.962 7	0.976 7	0.874 5	0.961 5
Sat	0.836 9	0.707 1	0.729 8	0.840 7	0.711 4	0.733 7	0.835 9	0.715 8	0.750 5	0.835 7	0.749 5	0.838 8
PD	0.984 5	0.678 4	0.929 3	0.987 4	0.680 6	0.933 2	0.981 2	0.709 2	0.901 3	0.988 1	0.9571	0.9757
SE	0.793 2	0.668 7	0.890 7	0.785 8	0.696 8	0.911 0	0.762 1	0.694 6	0.901 2	0.892 8	0.801 3	0.943 6
Spe	0.878 7	0.808 4	0.889 7	0.894 5	0.720 9	0.905 3	0.877 8	0.808 3	0.898 8	0.904 1	0.837 6	0.919 4
Sce	0.753 6	0.679 5	0.830 1	0.782 2	0.681 5	0.842 4	0.797 0	0.699 1	0.852 8	0.798 6	0.702 1	0.858 3
Arr	0.638 1	0.705 6	0.941 5	0.675 7	0.732 8	0.944 7	0.658 3	0.759 6	0.948 0	0.709 6	0.792 0	0.963 9
WQ	0.733 4	0.710 9	0.718 1	0.744 2	0.723 7	0.733 3	0.741 5	0.745 4	0.734 4	0.750 8	0.739 2	0.739 2
OL	0.849 6	0.763 5	0.893 5	0.870 6	0.818 7	0.907 3	0.930 9	0.867 8	0.949 2	0.881 4	0.810 8	0.911 7
Mam	0.892 6	0.800 1	0.876 0	0.901 1	0.805 3	0.883 1	0.912 5	0.713 5	0.897 8	0.915 6	0.815 9	0.899 1
A19	0.768 1	0.623 6	0.802 9	0.781 2	0.628 6	0.809 2	0.926 1	0.810 8	0.941 8	0.776 3	0.632 2	0.820 7

数据集	ROS			SMOTE			BSMO			3WOS
数据集	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR	SVM	NB	LR
OD	0.970 7	0.860 0	0.929 4	0.976 2	0.868 7	0.933 2	0.971 2	0.849 6	0.899 2	0.976 6	0.873 6	0.975 7
Sat	0.828 7	0.670 9	0.711 4	0.833 8	0.673 2	0.716 8	0.819 5	0.681 9	0.738 4	0.825 1	0.709 3	0.830 3
PD	0.984 4	0.673 9	0.929 4	0.987 4	0.674 7	0.933 2	0.981 0	0.708 6	0.899 2	0.988 1	0.957 1	0.975 7
SE	0.781 3	0.624 3	0.889 6	0.770 6	0.655 0	0.910 4	0.737 0	0.649 2	0.899 3	0.891 2	0.792 3	0.943 8
Spe	0.879 8	0.805 1	0.887 9	0.891 8	0.699 7	0.904 1	0.876 5	0.787 8	0.894 4	0.899 3	0.828 7	0.919 1
Sce	0.745 0	0.679 5	0.829 0	0.775 5	0.689 8	0.840 6	0.796 0	0.697 4	0.851 0	0.795 1	0.697 7	0.856 6
Arr	0.627 7	0.671 0	0.942 1	0.645 0	0.708 1	0.944 1	0.627 3	0.730 6	0.947 1	0.667 2	0.763 1	0.962 9
WQ	0.729 0	0.705 3	0.717 1	0.741 5	0.719 3	0.732 5	0.734 8	0.741 5	0.732 4	0.747 8	0.738 2	0.738 8
OL	0.848 2	0.759 0	0.893 0	0.868 1	0.812 7	0.906 7	0.929 3	0.858 6	0.948 2	0.880 1	0.803 6	0.910 5
Mam	0.891 6	0.798 6	0.875 8	0.900 6	0.803 6	0.882 9	0.912 2	0.694 5	0.897 0	0.915 3	0.813 6	0.899 0
A1 9	0.758 4	0.545 1	0.802 6	0.761 9	0.544 5	0.807 2	0.924 0	0.789 8	0.9405	0.754 3	0.557 1	0.818 3