JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2023, Vol. 58 ›› Issue (12): 41-51.doi: 10.6040/j.issn.1671-9352.4.2022.3492

Previous Articles     Next Articles

Three-way over-sampling method for imbalanced data classification

Yu FANG*(),Huyu ZHENG,Xuemei CAO   

  1. School of Computer Science, Southwest Petroleum University, Chengdu 610500, Sichuan, China
  • Received:2022-08-02 Online:2023-12-20 Published:2023-12-19
  • Contact: Yu FANG E-mail:fangyu@swpu.edu.cn

Abstract:

This paper proposes a new sampling method combined with three-way decisions and SMOTE, referred to as three-way over-sampling (3WOS). 3WOS constructs the three-way decisions model for all samples and select the samples in the model boundary domain as key samples for SMOTE oversampling. Consequently, the problem of sample aggregation and separation is alleviated effectively. Moreover, the performance of the classifier is improved to a certain extent. Firstly, the method divides all samples into three parts according to the three-way decisions and support vector data description. Secondly, finding the k nearest neighbors of minority class for all key sample and using linear interpolation to synthesize new samples for each key sample to generate the new minority samples. Finally, the updated sample set is used to train the classifier. Experimental results show that the 3WOS method has better performance in classification accuracy, F-measure, G-mean and less cost on the base classifiers than other methods.

Key words: imbalanced data, three-way decision, support vector data description, SMOTE, classification

CLC Number: 

  • TP181

Fig.1

Overall framework of 3WS model"

Fig.2

Overall framework of 3WOS model"

Table 1

Summary description of dataset"

数据集 属性数量 样本数量 不平衡率 数据集 属性数量 样本数量 不平衡率
Optical Digits (OD) 64 5 620 9.1∶1 Arrhythmia (Arr) 279 452 17∶1
Satlmage (Sat) 36 6 435 9.3∶1 Wine Quality (WQ) 11 4 898 26∶1
Pen Digits (PD) 16 10 992 9.4∶1 Ozone Level (OL) 72 2 536 34∶1
Sick Euthyroid (SE) 25 3 163 9.8∶1 Mammography (Mam) 6 11 183 42∶1
Spectrometer (Spe) 93 531 11∶1 Abalone19 (A19) 8 4 177 130∶1
Scene (Sce) 294 2 407 13∶1

Table 2

Confusion matrix"

类别 预测结果
正类 负类
正类 TP FN
负类 FP TN

Table 3

Cost matrix"

类别 预测类
正类 负类
正类 C(+, +) C(-, +)
负类 C(+, -) C(-, -)

Fig.3

Comparison of classification performance with different parameter on dataset Arrhythmia"

Fig.4

Comparison of classification performance with different parameter on dataset Satlmage"

Table 4

Accuracy comparison of different algorithms 单位: %"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.970 7 0.860 8 0.947 8 0.976 1 0.869 4 0.954 4 0.971 5 0.850 1 0.962 7 0.976 7 0.874 5 0.961 5
Sat 0.836 9 0.707 1 0.729 8 0.840 7 0.711 4 0.733 7 0.835 9 0.715 8 0.750 5 0.835 7 0.749 5 0.838 8
PD 0.984 5 0.678 4 0.929 3 0.987 4 0.680 6 0.933 2 0.981 2 0.709 2 0.901 3 0.988 1 0.9571 0.9757
SE 0.793 2 0.668 7 0.890 7 0.785 8 0.696 8 0.911 0 0.762 1 0.694 6 0.901 2 0.892 8 0.801 3 0.943 6
Spe 0.878 7 0.808 4 0.889 7 0.894 5 0.720 9 0.905 3 0.877 8 0.808 3 0.898 8 0.904 1 0.837 6 0.919 4
Sce 0.753 6 0.679 5 0.830 1 0.782 2 0.681 5 0.842 4 0.797 0 0.699 1 0.852 8 0.798 6 0.702 1 0.858 3
Arr 0.638 1 0.705 6 0.941 5 0.675 7 0.732 8 0.944 7 0.658 3 0.759 6 0.948 0 0.709 6 0.792 0 0.963 9
WQ 0.733 4 0.710 9 0.718 1 0.744 2 0.723 7 0.733 3 0.741 5 0.745 4 0.734 4 0.750 8 0.739 2 0.739 2
OL 0.849 6 0.763 5 0.893 5 0.870 6 0.818 7 0.907 3 0.930 9 0.867 8 0.949 2 0.881 4 0.810 8 0.911 7
Mam 0.892 6 0.800 1 0.876 0 0.901 1 0.805 3 0.883 1 0.912 5 0.713 5 0.897 8 0.915 6 0.815 9 0.899 1
A19 0.768 1 0.623 6 0.802 9 0.781 2 0.628 6 0.809 2 0.926 1 0.810 8 0.941 8 0.776 3 0.632 2 0.820 7

Table 5

G-mean comparison of different algorithms"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.970 7 0.860 0 0.929 4 0.976 2 0.868 7 0.933 2 0.971 2 0.849 6 0.899 2 0.976 6 0.873 6 0.975 7
Sat 0.828 7 0.670 9 0.711 4 0.833 8 0.673 2 0.716 8 0.819 5 0.681 9 0.738 4 0.825 1 0.709 3 0.830 3
PD 0.984 4 0.673 9 0.929 4 0.987 4 0.674 7 0.933 2 0.981 0 0.708 6 0.899 2 0.988 1 0.957 1 0.975 7
SE 0.781 3 0.624 3 0.889 6 0.770 6 0.655 0 0.910 4 0.737 0 0.649 2 0.899 3 0.891 2 0.792 3 0.943 8
Spe 0.879 8 0.805 1 0.887 9 0.891 8 0.699 7 0.904 1 0.876 5 0.787 8 0.894 4 0.899 3 0.828 7 0.919 1
Sce 0.745 0 0.679 5 0.829 0 0.775 5 0.689 8 0.840 6 0.796 0 0.697 4 0.851 0 0.795 1 0.697 7 0.856 6
Arr 0.627 7 0.671 0 0.942 1 0.645 0 0.708 1 0.944 1 0.627 3 0.730 6 0.947 1 0.667 2 0.763 1 0.962 9
WQ 0.729 0 0.705 3 0.717 1 0.741 5 0.719 3 0.732 5 0.734 8 0.741 5 0.732 4 0.747 8 0.738 2 0.738 8
OL 0.848 2 0.759 0 0.893 0 0.868 1 0.812 7 0.906 7 0.929 3 0.858 6 0.948 2 0.880 1 0.803 6 0.910 5
Mam 0.891 6 0.798 6 0.875 8 0.900 6 0.803 6 0.882 9 0.912 2 0.694 5 0.897 0 0.915 3 0.813 6 0.899 0
A1 9 0.758 4 0.545 1 0.802 6 0.761 9 0.544 5 0.807 2 0.924 0 0.789 8 0.9405 0.754 3 0.557 1 0.818 3

Table 6

F-measure comparison of different algorithms"

数据集 ROS SMOTE BSMO 3WOS
SVM NB LR SVM NB LR SVM NB LR SVM NB LR
OD 0.971 1 0.865 9 0.947 8 0.976 0 0.873 9 0.954 8 0.971 2 0.854 2 0.963 5 0.977 1 0.879 3 0.961 9
Sat 0.853 9 0.760 6 0.767 7 0.856 2 0.765 4 0.769 8 0.8591 0.766 7 0.779 8 0.854 9 0.798 3 0.856 0
PD 0.984 8 0.650 8 0.929 8 0.987 5 0.674 7 0.933 1 0.981 7 0.717 4 0.907 0 0.988 3 0.957 2 0.975 5
SE 0.818 2 0.732 8 0.895 3 0.814 4 0.754 9 0.914 0 0.800 8 0.755 0 0.906 8 0.885 8 0.822 6 0.944 7
Spe 0.879 4 0.824 9 0.896 0 0.894 5 0.763 6 0.906 0 0.877 8 0.837 9 0.907 0 0.904 1 0.855 3 0.922 4
Sce 0.778 7 0.681 4 0.836 9 0.802 6 0.712 1 0.850 5 0.804 9 0.697 5 0.860 6 0.813 0 0.723 7 0.865 5
Arr 0.670 1 0.758 3 0.941 4 0.728 8 0.775 0 0.946 0 0.709 5 0.801 0 0.949 9 0.766 2 0.828 4 0.965 5
WQ 0.710 1 0.682 8 0.706 9 0.726 7 0.699 9 0.723 7 0.712 7 0.724 4 0.719 3 0.733 1 0.731 4 0.732 1
OL 0.856 9 0.781 5 0.896 8 0.878 6 0.834 8 0.910 3 0.934 5 0.882 7 0.951 3 0.886 9 0.829 1 0.915 8
Mam 0.887 8 0.806 0 0.874 1 0.898 0 0.813 9 0.880 6 0.914 1 0.753 1 0.901 6 0.913 8 0.824 9 0.898 0
A19 0.792 1 0.711 1 0.807 3 0.813 4 0.717 3 0.819 7 0.930 5 0.840 1 0.9447 0.811 0 0.717 7 0.831 4

Fig.5

Misclassification cost for comparison of different algorithms"

1 ZHENG Z H , WU X Y , SRIHARI R . Feature selection for text categorization on imbalanced data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6 (1): 80- 89.
2 HE H B , GARCIA E A . Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9): 1263- 1284.
3 ANAND A , PUGALENTHI G , FOGEL G B , et al. An approach for classification of highly imbalanced data using weighting and undersampling[J]. Amino Acids, 2010, 39 (5): 1385- 1391.
4 LIU L , CAI Y D , LU W C , et al. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection[J]. Biochemical and Biophysical Research Communications, 2009, 380 (2): 318- 322.
5 CHAWLA N V , BOWYER K W , HALL L O , et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16 (1): 321- 357.
6 HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the International Conference on Intelligent Computing. Berlin: Springer, 2005: 878-887.
7 HE H B, YANG B, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: IEEE, 2008: 1322-1328.
8 BARUA S , ISLAM M M , YAO X , et al. MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 26 (2): 405- 425.
9 祝团飞, 孙婧, 李宜洲, 等. BOS: 一种用于不平衡数据学习的边界过采样方法[J]. 四川大学学报(自然科学版), 2012, 49 (3): 553- 559.
ZHU Tuanfei , SUN Jing , LI Yizhou , et al. BOS: a borderline over-sampling method for imbalanced data learning[J]. Journal of Sichuan University (Natural Science Edition), 2012, 49 (3): 553- 559.
10 FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
11 LIU D , LIANG D C , WANG C C . A novel three-way decision model based on incomplete information system[J]. Knowledge-Based Systems, 2016, 91, 32- 45.
12 YAO Y Y. Three-way decision: an interpretation of rules in rough set theory[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2009: 642-649.
13 PAWLAK Z . Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11 (5): 341- 356.
14 PAWLAK Z . Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht: Kluwer Academic Publishers, 1992.
15 YAO Y Y. Decision-theoretic rough set models[C]//Proceedings of the International Conference on Rough Sets and Knowledge Technology. Berlin: Springer, 2007: 1-12.
16 YAO Y Y. An outline of a theory of three-way decisions[C]//Proceedings of the International Conference on Rough Sets and Current Trends in Computing. Berlin: Springer, 2012: 1-17.
17 YAN Y T , WU Z B , DU X Q , et al. A three-way decision ensemble method for imbalanced data oversampling[J]. International Journal of Approximate Reasoning, 2019, 107, 1- 16.
18 TAO X M , ZHENG Y J , CHEN W , et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Information Sciences, 2022, 588, 13- 51.
19 TAX D M , DUIN R P . Support vector data description[J]. Machine Learning, 2004, 54 (1): 45- 66.
20 FANG Y, CAO X M, WANG X, et al. Hypersphere neighborhood rough set for rapid attribute reduction[C]//Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer, 2022: 161-173
21 FANG Y , CAO X M , WANG X , et al. Three-way sampling for rapid attribute reduction[J]. Information Sciences, 2022, 609, 26- 45.
22 JIANG H S, WANG H Y, HU W H, et al. Fast incremental SVDD learning algorithm with the Gaussian kernel[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019.
23 JIANG K , LU J , XIA K L . A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian Journal for Science and Engineering, 2016, 41 (8): 3255- 3266.
24 FANG Y , GAO C , YAO Y Y . Granularity-driven sequential three-way decisions: a cost-sensitive approach to classification[J]. Information Sciences, 2020, 507, 644- 664.
25 THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data[C]//Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010: 1-8.
[1] Qian WANG,Xianyong ZHANG. Incomplete neighborhood weighted multi-granularity decision-theoretic rough sets and three-way decision [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(9): 94-104.
[2] Junyu WANG,Yafeng YANG,Jingxuan XUE,Lihong LI. Extension sequential three-way decision model and its application [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(7): 67-79.
[3] MENG Jinxu, SHAN Hongtao, HUANG Runcai, YAN Fengting, LI Zhiwei, ZHENG Guangyuan, LIU Yiming, SHI Changtong. Text classification model based on dual-channel feature fusion based on XLNet [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(5): 36-45.
[4] FAN Jia-chen, WANG Ping-xin, YANG Xi-bei. Density-sensitive spectral clustering based on three-way decision [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(1): 59-66.
[5] QIAN Jin, TANG Da-wei, HONG Cheng-xin. Research on multi-granularity hierarchical sequential three-way decision model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(9): 33-45.
[6] GONG Zeng-tai, TA Guang-peng. Semantics of the soft set induced by intuitionistic fuzzy set and its three-way decision [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(8): 68-76.
[7] XUE Zhan-ao, LI Yong-xiang, YAO Shou-qian, JING Meng-meng. Data classification method based on Bayesian intuitionistic fuzzy rough sets [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(5): 1-10.
[8] SHI Ji, SUO Zhong-ying. Loss function determination method based on interval number analytic hierarchy process [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(5): 28-37.
[9] ZHENG Cheng-yu, WANG Xin, WANG Ting, DENG Ya-ping, YIN Tian-tian. Multi-label classification for medical text based on ALBERT-TextCNN model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(4): 21-29.
[10] YANG Jie, LUO Tian, LI Yang-jun. Unlabeled sequential three-way decisions model based on TOPSIS [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(3): 41-48.
[11] ZHONG Kun-yan, LIU Jing-lei. Image classification based on low-rank inter-class sparsity discriminant least squares regression [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(11): 89-101.
[12] ZHANG Bin-yan, ZHU Xiao-fei, XIAO Zhao-hui, HUANG Xian-ying, WU Jie. Short text classification based on semi-supervised graph neural network [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2021, 56(5): 57-65.
[13] LI Min, YANG Ya-feng, LEI Yu, LI Li-hong. Optimal granularity selection based on minimum cost of extension domain change [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2021, 56(2): 17-27.
[14] YIN Ai-ying, LIN Jian-zhou, WU Yun-bing, LIAO Xiang-wen. Sentiment classification combining graph convolution neural network [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2021, 56(11): 15-23.
[15] Ying YU,Xin-nian WU,Le-wei WANG,Ying-long ZHANG. A multi-label three-way classification algorithm based on label correlation [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2020, 55(3): 81-88.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] YANG Jun. Characterization and structural control of metalbased nanomaterials[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2013, 48(1): 1 -22 .
[2] HE Hai-lun, CHEN Xiu-lan* . Circular dichroism detection of the effects of denaturants and buffers on the conformation of cold-adapted protease MCP-01 and  mesophilic protease BP01[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2013, 48(1): 23 -29 .
[3] ZHAO Jun1, ZHAO Jing2, FAN Ting-jun1*, YUAN Wen-peng1,3, ZHANG Zheng1, CONG Ri-shan1. Purification and anti-tumor activity examination of water-soluble asterosaponin from Asterias rollestoni Bell[J]. J4, 2013, 48(1): 30 -35 .
[4] SUN Xiao-ting1, JIN Lan2*. Application of DOSY in oligosaccharide mixture analysis[J]. J4, 2013, 48(1): 43 -45 .
[5] LUO Si-te, LU Li-qian, CUI Ruo-fei, ZHOU Wei-wei, LI Zeng-yong*. Monte-Carlo simulation of photons transmission at alcohol wavelength in  skin tissue and design of fiber optic probe[J]. J4, 2013, 48(1): 46 -50 .
[6] YANG Lun, XU Zheng-gang, WANG Hui*, CHEN Qi-mei, CHEN Wei, HU Yan-xia, SHI Yuan, ZHU Hong-lei, ZENG Yong-qing*. Silence of PID1 gene expression using RNA interference in C2C12 cell line[J]. J4, 2013, 48(1): 36 -42 .
[7] MAO Ai-qin1,2, YANG Ming-jun2, 3, YU Hai-yun2, ZHANG Pin1, PAN Ren-ming1*. Study on thermal decomposition mechanism of  pentafluoroethane fire extinguishing agent[J]. J4, 2013, 48(1): 51 -55 .
[8] YANG Ying, JIANG Long*, SUO Xin-li. Choquet integral representation of premium functional and related properties on capacity space[J]. J4, 2013, 48(1): 78 -82 .
[9] LI Yong-ming1, DING Li-wang2. The r-th moment consistency of estimators for a semi-parametric regression model for positively associated errors[J]. J4, 2013, 48(1): 83 -88 .
[10] DONG Wei-wei. A new method of DEA efficiency ranking for decision making units with independent subsystems[J]. J4, 2013, 48(1): 89 -92 .