《山东大学学报(理学版)》 ›› 2023, Vol. 58 ›› Issue (12): 41-51.doi: 10.6040/j.issn.1671-9352.4.2022.3492
Yu FANG*(),Huyu ZHENG,Xuemei CAO
摘要:
结合三支决策和合成少数过采样技术(synthetic minority over-sampling technique, SMOTE), 提出了一种新的采样方法—三支过采样(three-way over-sampling, 3WOS)。3WOS通过对所有样本构建三支决策模型, 选取该模型边界域中的样本作为关键样本进行SMOTE过采样, 从而有效缓解样本聚集和分离问题, 在一定程度上提高了分类器性能。该方法首先在少数类样本上应用三支决策和支持向量数据描述, 将所有样本数据进行三分; 其次, 找出所有关键样本的k个最近邻少数类样本, 并使用线性插值方式对每个关键样本合成新样本, 然后形成新的少数类样本; 最后, 将更新后的样本集用于训练分类器。实验结果表明, 3WOS方法比其他方法在基分类器上有较好的分类准确度、F-measure、G-mean和较少的代价值。
中图分类号:
1 | ZHENG Z H , WU X Y , SRIHARI R . Feature selection for text categorization on imbalanced data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6 (1): 80- 89. |
2 | HE H B , GARCIA E A . Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9): 1263- 1284. |
3 | ANAND A , PUGALENTHI G , FOGEL G B , et al. An approach for classification of highly imbalanced data using weighting and undersampling[J]. Amino Acids, 2010, 39 (5): 1385- 1391. |
4 | LIU L , CAI Y D , LU W C , et al. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection[J]. Biochemical and Biophysical Research Communications, 2009, 380 (2): 318- 322. |
5 | CHAWLA N V , BOWYER K W , HALL L O , et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16 (1): 321- 357. |
6 | HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the International Conference on Intelligent Computing. Berlin: Springer, 2005: 878-887. |
7 | HE H B, YANG B, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: IEEE, 2008: 1322-1328. |
8 | BARUA S , ISLAM M M , YAO X , et al. MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 26 (2): 405- 425. |
9 | 祝团飞, 孙婧, 李宜洲, 等. BOS: 一种用于不平衡数据学习的边界过采样方法[J]. 四川大学学报(自然科学版), 2012, 49 (3): 553- 559. |
ZHU Tuanfei , SUN Jing , LI Yizhou , et al. BOS: a borderline over-sampling method for imbalanced data learning[J]. Journal of Sichuan University (Natural Science Edition), 2012, 49 (3): 553- 559. | |
10 | FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45. |
11 | LIU D , LIANG D C , WANG C C . A novel three-way decision model based on incomplete information system[J]. Knowledge-Based Systems, 2016, 91, 32- 45. |
12 | YAO Y Y. Three-way decision: an interpretation of rules in rough set theory[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2009: 642-649. |
13 | PAWLAK Z . Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11 (5): 341- 356. |
14 | PAWLAK Z . Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht: Kluwer Academic Publishers, 1992. |
15 | YAO Y Y. Decision-theoretic rough set models[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2007: 1-12. |
16 | YAO Y Y. An outline of a theory of three-way decisions[C]//Proceedings of the International Conference on Rough Sets and Current Trends in Computing. Berlin: Springer, 2012: 1-17. |
17 | YAN Y T , WU Z B , DU X Q , et al. A three-way decision ensemble method for imbalanced data oversampling[J]. International Journal of Approximate Reasoning, 2019, 107, 1- 16. |
18 | TAO X M , ZHENG Y J , CHEN W , et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Information Sciences, 2022, 588, 13- 51. |
19 | TAX D M , DUIN R P . Support vector data description[J]. Machine Learning, 2004, 54 (1): 45- 66. |
20 | FANG Y, CAO X M, WANG X, et al. Hypersphere neighborhood rough set for rapid attribute reduction[C]//Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer, 2022: 161-173 |
21 | FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45. |
22 | JIANG H S, WANG H Y, HU W H, et al. Fast incremental SVDD learning algorithm with the Gaussian kernel[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019. |
23 | JIANG K , LU J , XIA K L . A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian Journal for Science and Engineering, 2016, 41 (8): 3255- 3266. |
24 | FANG Y , GAO C , YAO Y Y . Granularity-driven sequential three-way decisions: a cost-sensitive approach to classification[J]. Information Sciences, 2020, 507, 644- 664. |
25 | THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data[C]//Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010: 1-8. |
[1] | 王茜,张贤勇. 不完备邻域加权多粒度决策理论粗糙集及三支决策[J]. 《山东大学学报(理学版)》, 2023, 58(9): 94-104. |
[2] | 王君宇,杨亚锋,薛静轩,李丽红. 可拓序贯三支决策模型及应用[J]. 《山东大学学报(理学版)》, 2023, 58(7): 67-79. |
[3] | 胡成祥,张莉,黄晓玲,王汇彬. 面向属性变化的动态邻域粗糙集知识更新方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 37-51. |
[4] | 孟金旭,单鸿涛,黄润才,闫丰亭,李志伟,郑光远,刘一鸣,石昌通. 基于XLNet的双通道特征融合文本分类模型[J]. 《山东大学学报(理学版)》, 2023, 58(5): 36-45. |
[5] | 凡嘉琛,王平心,杨习贝. 基于三支决策的密度敏感谱聚类[J]. 《山东大学学报(理学版)》, 2023, 58(1): 59-66. |
[6] | 钱进,汤大伟,洪承鑫. 多粒度层次序贯三支决策模型研究[J]. 《山东大学学报(理学版)》, 2022, 57(9): 33-45. |
[7] | 巩增泰,他广朋. 直觉模糊集所诱导的软集语义及其三支决策[J]. 《山东大学学报(理学版)》, 2022, 57(8): 68-76. |
[8] | 薛占熬,李永祥,姚守倩,荆萌萌. 基于Bayesian直觉模糊粗糙集的数据分类方法[J]. 《山东大学学报(理学版)》, 2022, 57(5): 1-10. |
[9] | 施极,索中英. 基于区间数层次分析法的损失函数确定方法[J]. 《山东大学学报(理学版)》, 2022, 57(5): 28-37. |
[10] | 郑承宇,王新,王婷,邓亚萍,尹甜甜. 基于ALBERT-TextCNN模型的多标签医疗文本分类方法[J]. 《山东大学学报(理学版)》, 2022, 57(4): 21-29. |
[11] | 杨洁,罗天,李阳军. 基于TOPSIS的无标签序贯三支决策模型[J]. 《山东大学学报(理学版)》, 2022, 57(3): 41-48. |
[12] | 钟堃琰,刘惊雷. 基于低秩类间稀疏判别最小二乘回归的图像分类[J]. 《山东大学学报(理学版)》, 2022, 57(11): 89-101. |
[13] | 张斌艳,朱小飞,肖朝晖,黄贤英,吴洁. 基于半监督图神经网络的短文本分类[J]. 《山东大学学报(理学版)》, 2021, 56(5): 57-65. |
[14] | 李敏,杨亚锋,雷宇,李丽红. 基于可拓域变化代价最小的最优粒度选择[J]. 《山东大学学报(理学版)》, 2021, 56(2): 17-27. |
[15] | 阴爱英,林建洲,吴运兵,廖祥文. 融合图卷积神经网络的文本情感分类[J]. 《山东大学学报(理学版)》, 2021, 56(11): 15-23. |
|