您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2024, Vol. 59 ›› Issue (3): 107-117.doi: 10.6040/j.issn.1671-9352.2.2023.027

•   • 上一篇    下一篇

基于属性加权的ML-KNN方法

温欣1(),李德玉1,2,*()   

  1. 1. 山西大学计算机与信息技术学院, 山西 太原 030006
    2. 山西大学计算智能与中文信息处理教育部重点实验室, 山西 太原 030006
  • 收稿日期:2023-05-29 出版日期:2024-03-20 发布日期:2024-03-06
  • 通讯作者: 李德玉 E-mail:1368661957@qq.com;lidysxu@163.com
  • 作者简介:温欣(1994—), 女, 博士研究生, 研究方向为多标记学习. E-mail: 1368661957@qq.com
  • 基金资助:
    国家自然科学基金资助项目(62072294)

The ML-KNN method based on attribute weighting

Xin WEN1(),Deyu LI1,2,*()   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, Shanxi, China
  • Received:2023-05-29 Online:2024-03-20 Published:2024-03-06
  • Contact: Deyu LI E-mail:1368661957@qq.com;lidysxu@163.com

摘要:

提出了一种基于属性加权的ML-KNN方法。首先使用变精度邻域粗糙集识别来自每一个标记的决策类非正域中的样本, 并构造异质样本对; 然后基于属性对异质样本对的区分能力评估不同属性对于分类的重要度; 最后计算样本之间的加权距离获得其近邻分布, 且基于最大化后验概率的原则实现多标记分类。在10个公开的多标记数据集上的实验结果验证了所提方法的有效性。

关键词: 多标记分类, 属性重要度, 邻域粗糙集, 分类不确定性, 异质样本对

Abstract:

A ML-KNN method based on attribute weighting has been proposed. To be specific, we first identify samples from the non-positive regions of decision classes by means of the variable precision neighborhood rough set model with respect to each label and construct the heterogeneous sample pairs. Then, the significance of different attributes for classification is evaluated based on their discernibility for the heterogeneous sample pairs. Finally, the weighted distances between samples are calculated in order to obtain the nearest neighbor distributions of samples. At the same time, based on the principle of maximizing the posterior probability, the multi-label classification is implemented. Further, the experimental results on ten public multi-label data sets verify the effectiveness of the proposed method.

Key words: multi-label classification, attribute significance, neighborhood rough set, uncertainty of classification, heterogeneous sample pair

中图分类号: 

  • TP391

表1

数据集描述"

Number Data set Sample Attribute Label Domain
1 GpositivePseAAC 519 44 0 4 biology
2 Emotions 593 72 6 music
3 Medical 978 1 449 45 text
4 Water-quality 1 060 16 14 chemistry
5 Image 2 000 294 5 image
6 Scene 2 407 294 6 image
7 Yeast 2 417 103 14 biology
8 Business 5 000 438 30 text
9 Yelp 10 810 671 5 text
10 Mediamill 43 907 120 101 video

表2

邻域参数δ的变化范围"

Data set δ
GpositivePseAAC 4.00~4.35
Emotions 1.30~1.55
Medical 2.80~3.05
Water-quality 1.50~1.75
Image 4.30~4.60
Scene 2.75~3.00
Yeast 1.25~1.50
Business 1.70~1.95
Yelp 6.00~6.25
Mediamill 2.00~2.35

图1

GpositivePseAAC数据集中NRS_MLKNN在不同参数下的分类性能"

表3

7种算法在GpositivePseAAC数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.163 8±0.030 3 0.160 6±0.024 1 0.323 7±0.046 7 0.487 5±0.063 1 0.813 7±0.026 1
LPLC 0.164 6±0.024 2 0.162 0±0.036 7 0.285 2±0.059 4 0.464 4±0.107 1 0.824 8±0.036 5
ML-KNN 0.155 1±0.026 7 0.157 2±0.029 6 0.310 2±0.059 6 0.479 9±0.084 4 0.819 5±0.033 6
Stacked_KNN 0.148 3±0.035 2 0.159 1±0.037 9 0.314 0±0.061 3 0.487 5±0.113 6 0.817 5±0.037 8
LAMLKNN 0.154 1±0.029 0 0.149 3±0.025 7 0.292 9±0.055 7 0.454 7±0.073 7 0.829 5±0.028 9
ML_RKNN 0.248 1±0.028 5 0.583 3±0.077 2 0.233 3±0.044 1 0.977 2±0.147 6 0.675 7±0.045 4
NRS_MLKNN 0.147 4±0.029 4 0.146 9±0.030 2 0.289 0±0.064 4 0.449 0±0.087 4 0.831 6±0.035 3

表4

7种算法在Emotions数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.193 0±0.016 0 0.169 6±0.023 9 0.263 1±0.055 7 1.804 2±0.146 7 0.801 4±0.027 4
LPLC 0.202 5±0.022 6 0.159 5±0.026 1 0.273 1±0.040 8 1.768 4±0.160 7 0.802 3±0.023 5
ML-KNN 0.192 5±0.017 2 0.162 1±0.017 3 0.266 6±0.030 5 1.797 5±0.088 5 0.799 6±0.015 5
Stacked_KNN 0.198 6±0.024 1 0.172 7±0.026 8 0.268 2±0.054 5 1.848 2±0.155 4 0.793 5±0.031 9
LAMLKNN 0.195 0±0.014 8 0.159 5±0.023 3 0.283 2±0.057 4 1.762 0±0.134 2 0.800 3±0.026 5
ML_RKNN 0.323 2±0.032 4 0.339 9±0.059 4 0.379 4±0.052 6 2.664 3±0.334 6 0.686 5±0.040 4
NRS_MLKNN 0.195 0±0.012 1 0.159 2±0.012 3 0.263 2±0.036 3 1.777 3±0.081 5 0.803 5±0.014 7

表5

7种算法在Medical数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.018 7±0.002 3 0.104 3±0.024 7 0.338 5±0.044 1 3.615 7±1.078 1 0.745 6±0.033 9
LPLC 0.018 8±0.00 2 0.079 5±0.015 9 0.283 3±0.036 7 4.401 5±1.092 2 0.757 8±0.037 8
ML-KNN 0.015 6±0.002 1 0.042 0±0.011 4 0.249 6±0.041 7 2.745 1±0.818 7 0.808 3±0.030 5
Stacked_KNN 0.01 5±0.002 1 0.057 6±0.015 5 0.248 5±0.037 3 3.551 4±1.052 9 0.791 0±0.027 2
LAMLKNN 0.015 9±0.002 1 0.037 4±0.010 5 0.244 5±0.042 2 2.225 2±0.697 5 0.816 5±0.029 4
ML_RKNN 0.052 2±0.006 7 0.431 0±0.048 3 0.273 0±0.033 3 13.564 3±2.00 4 0.522 4±0.033 9
NRS_MLKNN 0.014 0±0.002 2 0.042 4±0.011 2 0.220 9±0.032 0 2.795 5±0.787 2 0.820 4±0.024 0

表6

7种算法在Water-quality数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.340 8±0.009 8 0.297 8±0.016 1 0.336 9±0.050 4 9.174 5±0.206 0 0.645 7±0.022 6
LPLC 0.316 3±0.009 1 0.263 4±0.015 8 0.284 6±0.042 4 8.889 6±0.220 1 0.684 5±0.019 2
ML-KNN 0.292 0±0.011 2 0.259 4±0.013 5 0.293 2±0.052 4 8.776 4±0.241 2 0.689 8±0.020 2
Stacked_KNN 0.297 1±0.009 3 0.266 7±0.016 7 0.319 7±0.047 6 8.837 7±0.193 7 0.677 5±0.020 1
LAMLKNN 0.294 7±0.008 9 0.261 8±0.014 0 0.279 0±0.033 7 8.853 8±0.278 7 0.688 3±0.018 9
ML_RKNN 0.404 4±0.020 5 0.385 3±0.017 2 0.423 2±0.041 4 10.317 9±0.241 2 0.590 0±0.018 2
NRS_MLKNN 0.290 4±0.009 7 0.259 7±0.016 4 0.279 9±0.046 2 8.776 4±0.256 9 0.691 5±0.022 5

表7

7种算法在Image数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.175 4±0.014 7 0.186 8±0.019 6 0.333 5±0.034 1 0.978 0±0.103 8 0.786 2±0.020 4
LPLC 0.178 4±0.014 2 0.196 8±0.020 7 0.330 0±0.028 7 0.999 5±0.099 3 0.780 8±0.017 1
ML-KNN 0.170 1±0.014 1 0.176 5±0.020 2 0.319 5±0.033 2 0.978 0±0.103 4 0.790 0±0.020 3
Stacked_KNN 0.176 5±0.016 2 0.188 0±0.023 2 0.333 0±0.030 0 1.018 0±0.115 7 0.780 6±0.022 1
LAMLKNN 0.170 8±0.015 3 0.177 2±0.020 4 0.321 0±0.032 3 0.983 0±0.112 8 0.788 5±0.020 8
ML_RKNN 0.287 1±0.013 9 0.317 4±0.025 9 0.378 0±0.030 3 1.346 5±0.096 4 0.716 7±0.020 3
NRS_MLKNN 0.171 7±0.015 7 0.174 7±0.021 6 0.320 0±0.036 1 0.968 5±0.112 1 0.791 5±0.021 9

表8

7种算法在Scene数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.092 3±0.006 1 0.099 2±0.011 5 0.253 0±0.017 1 0.539 3±0.064 4 0.848 6±0.011 7
LPLC 0.096 5±0.006 5 0.090 8±0.010 6 0.250 5±0.021 8 0.519 8±0.062 6 0.847 0±0.013 1
ML-KNN 0.085 2±0.008 2 0.076 8±0.009 1 0.226 0±0.015 9 0.470 7±0.059 3 0.866 5±0.009 9
Stacked_KNN 0.087 9±0.005 5 0.085 3±0.008 6 0.232 2±0.013 8 0.515 6±0.056 8 0.859 1±0.008 6
LAMLKNN 0.085 5±0.006 7 0.074 0±0.008 7 0.225 2±0.011 8 0.455 8±0.052 6 0.867 8±0.008 4
ML_RKNN 0.164 9±0.008 9 0.254 7±0.031 3 0.286 2±0.030 4 1.108 8±0.108 2 0.761 1±0.020 9
NRS_MLKNN 0.084 7±0.006 4 0.075 4±0.009 4 0.220 2±0.015 4 0.463 7±0.061 8 0.869 5±0.010 4

表9

7种算法在Yeast数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.204 4±0.009 0 0.181 5±0.009 0 0.240 0±0.020 6 6.391 6±0.213 2 0.748 2±0.013 7
LPLC 0.204 0±0.012 5 0.168 9±0.009 7 0.229 2±0.029 5 6.311 6±0.187 1 0.762 4±0.018 4
ML-KNN 0.192 7±0.006 6 0.164 3±0.008 7 0.230 5±0.026 0 6.202 4±0.168 9 0.765 8±0.013 7
Stacked_KNN 0.198 5±0.009 9 0.179 3±0.009 2 0.254 9±0.030 3 6.509 0±0.125 7 0.749 1±0.017 1
LAMLKNN 0.193 8±0.007 0 0.165 1±0.008 4 0.225 9±0.022 1 6.222 7±0.153 2 0.765 1±0.013 1
ML_RKNN 0.375 9±0.018 2 0.381 3±0.021 1 0.467 4±0.034 7 9.080 2±0.218 8 0.575 1±0.021 1
NRS_MLKNN 0.192 8±0.006 6 0.163 4±0.008 3 0.227 6±0.021 8 6.196 2±0.164 5 0.767 7±0.013 3

表10

7种算法在Business数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.027 7±0.001 7 0.117 7±0.013 6 0.125 6±0.019 6 4.630 6±0.443 7 0.854 4±0.016 2
LPLC 0.026 7±0.001 9 0.061 2±0.004 7 0.124 6±0.021 0 3.393 0±0.235 9 0.862 6±0.015 9
ML-KNN 0.026 9±0.001 7 0.040 0±0.004 6 0.119 4±0.017 1 2.255 2±0.171 4 0.879 1±0.013 1
Stacked_KNN 0.026 1±0.001 3 0.038 7±0.003 2 0.109 6±0.013 4 2.247 8±0.148 0 0.883 4±0.009 5
LAMLKNN 0.026 8±0.001 8 0.040 1±0.004 6 0.119 2±0.019 3 2.265 4±0.171 7 0.879 3±0.013 6
ML_RKNN 0.110 9±0.003 8 0.397 8±0.023 8 0.485 4±0.029 9 13.596 4±0.845 9 0.468 3±0.014 8
NRS_MLKNN 0.026 6±0.001 7 0.038 9±0.003 6 0.115 0±0.017 3 2.217 0±0.138 1 0.881 6±0.012 0

表11

7种算法在Yelp数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.226 4 0.370 1 0.540 1 0.814 3 0.644 8
LPLC 0.231 4 0.331 7 0.475 8 0.830 6 0.660 9
ML-KNN 0.179 8 0.282 1 0.515 9 0.707 7 0.672 1
Stacked_KNN 0.234 5 0.335 3 0.507 6 0.890 3 0.652 6
LAMLKNN 0.180 4 0.266 8 0.500 7 0.667 3 0.683 5
ML_RKNN 0.174 8 0.936 6 0.049 0 0.954 1 0.595 6
NRS_MLKNN 0.177 4 0.276 5 0.499 3 0.693 9 0.681 8

表12

7种算法在Mediamill数据集上的分类性能"

Method HL↓ RL↓ OE↓ CV↓ AP↑
MLRS 0.032 8 0.156 7 0.168 4 28.658 7 0.676 7
LPLC 0.035 8 0.091 3 0.150 3 28.761 5 0.682 0
ML-KNN 0.031 5 0.055 0 0.147 3 18.645 6 0.703 4
Stacked_KNN 0.035 0 0.065 0 0.163 7 20.666 7 0.677 6
LAMLKNN 0.031 6 0.053 3 0.148 0 17.907 1 0.703 2
ML_RKNN 0.044 1 0.700 8 0.065 3 57.822 1 0.305 4
NRS_MLKNN 0.031 4 0.055 0 0.146 7 18.642 0 0.703 5

图2

不同方法在10个数据集中5个评估指标的平均排序"

1 YU Ying , PEDRYCZ W , MIAO Duoqian . Multi-label classification by exploiting label correlations[J]. Expert Systems with Applications, 2014, 41 (6): 2989- 3004.
doi: 10.1016/j.eswa.2013.10.030
2 TSOUMAKAS G , KATAKIS I . Multi-label classification: an overview[J]. International Journal of Data Warehousing and Mining, 2007, 3 (3): 1- 13.
doi: 10.4018/jdwm.2007070101
3 ZHANG Minling , ZHOU Zhihua . A review on multi-label learning algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26 (8): 1819- 1837.
doi: 10.1109/TKDE.2013.39
4 KASHEF S , NEZAMABADI-POUR H . A label-specific multi-label feature selection algorithm based on the Pareto dominance concept[J]. Pattern Recognition, 2019, 88, 654- 667.
doi: 10.1016/j.patcog.2018.12.020
5 LEE J , SEO W , PARK J H , et al. Compact feature subset-based multi-label music categorization for mobile devices[J]. Multimedia Tools and Applications, 2019, 78 (4): 4869- 4883.
doi: 10.1007/s11042-018-6100-8
6 WANG R , RIDLEY R , SU X A , et al. A novel reasoning mechanism for multi-label text classification[J]. Information Processing and Management, 2021, 58 (2): 102441.
doi: 10.1016/j.ipm.2020.102441
7 FABRIS F, FREITAS A A. Dependency network methods for hierarchical multi-label classification of gene functions[C]//2014 IEEE Symposium on Computational Intelligence and Data Mining. Piscataway: IEEE, 2014: 241-248.
8 AKHAND B, DEVI V S. Multi-label classification of discrete data[C]//IEEE International Conference on Fuzzy Systems. Piscataway: IEEE, 2013: 1-5.
9 BOUTELL M R , LUO J B , SHEN X P , et al. Learning multi-label scene classification[J]. Pattern Recognition, 2004, 37 (9): 1757- 1771.
doi: 10.1016/j.patcog.2004.03.009
10 TSOUMAKAS G , KATAKIS I , VLAHAVAS I P . Random k-labelsets for multilabel classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23 (7): 1079- 1089.
doi: 10.1109/TKDE.2010.164
11 READ J, PFAHRINGER B, HOLMES G, et al. Classifier chains for multi-label classification[C]//Machine Learning and Knowledge Discovery in Databases. European Conference, Berlin: Springer, 2009, 5782: 254-269.
12 ZHANG Minling , ZHOU Zhihua . ML-kNN: a lazy learning approach to multi-label learning[J]. Pattern Recognition, 2007, 40 (7): 2038- 2048.
doi: 10.1016/j.patcog.2006.12.019
13 PAKRASHI A, NAMEE B M. Stacked-MLkNN: a stacking based improvement to multi-label k-nearest neighbours[C]//First International Workshop on Learning with Imbalanced Domains: Theory and Applications. New York: PMLR, 2017, 74: 51-63.
14 WANG Dengbao, WANG Jingyuan, HU Fei, et al. A locally adaptive multi-label k-nearest neighbor algorithm[C]//Advances in Knowledge Discovery and Data Mining-22nd Pacific-Asia Conference. Berlin: Springer, 2018, 10937: 81-93.
15 SADHUKHAN P, PALIT S. Multi-label learning on principles of reverse k-nearest neighbourhood[J/OL]. Expert Systems, 2020. DOI: 10.1111/exsy.12615.
16 段洁, 胡清华, 张灵均, 等. 基于邻域粗糙集的多标记分类特征选择算法[J]. 计算机研究与发展, 2015, 52 (1): 56- 65.
DUAN Jie , HU Qinghua , ZHANG Lingjun , et al. Feature selection for multi-label classification based on neighborhood rough sets[J]. Journal of Computer Research and Development, 2015, 52 (1): 56- 65.
17 张文修, 吴伟志, 梁吉业, 等. 粗糙集理论与方法[M]. 北京: 科学出版社, 2001: 232.
ZHANG Wenxiu , WU Weizhi , LIANG Jiye , et al. Rough sets theory and methods[M]. Beijing: Science Press, 2001: 232.
18 HU Qinghua , YU Daren , LIU Jinfu , et al. Neighborhood rough set based heterogeneous feature subset selection[J]. Information Sciences, 2008, 178 (18): 3577- 3594.
doi: 10.1016/j.ins.2008.05.024
19 张晶, 李德玉, 王素格, 等. 基于稳健模糊粗糙集模型的多标记文本分类[J]. 计算机科学, 2015, 42 (7): 270- 275.
ZHANG Jing , LI Deyu , WANG Suge , et al. Multi-label text classification based on robust fuzzy rough set model[J]. Journal of Computer Science, 2015, 42 (7): 270- 275.
20 DAI Jianhua , HU Hu , WU Weizhi , et al. Maximal-discernibility-pair-based approach to attribute reduction in fuzzy rough sets[J]. IEEE Transactions on Fuzzy Systems, 2018, 26 (4): 2174- 2187.
doi: 10.1109/TFUZZ.2017.2768044
21 QIAN Wenbin , HUANG Jintao , WANG Yinglong , et al. Label distribution feature selection for multi-label classification with rough set[J]. International Journal of Approximate Reasoning, 2021, 128, 32- 55.
doi: 10.1016/j.ijar.2020.10.002
22 温欣, 李德玉, 王素格. 一种基于邻域关系和模糊决策的特征选择方法[J]. 南京大学学报(自然科学版), 2018, 54 (4): 733- 741.
WEN Xin , LI Deyu , WANG Suge . A method for feature selection based on neighborhood relation and fuzzy decision[J]. Journal of Nanjing University (Natural Sciences), 2018, 54 (4): 733- 741.
23 HUANG Jun , LI Guorong , WANG Shuhui , et al. Multi-label classification by exploiting local positive and negative pairwise label correlation[J]. Neurocomputing, 2017, 257, 164- 174.
doi: 10.1016/j.neucom.2016.12.073
[1] 胡成祥,张莉,黄晓玲,王汇彬. 面向属性变化的动态邻域粗糙集知识更新方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 37-51.
[2] 时俊鹏,张燕兰. 面向对象删除的局部邻域粗糙集动态更新算法[J]. 《山东大学学报(理学版)》, 2023, 58(5): 17-25.
[3] 刘长顺,刘炎,宋晶晶,徐泰华. 基于论域离散度的属性约简算法[J]. 《山东大学学报(理学版)》, 2023, 58(5): 26-35.
[4] 孙林,梁娜,徐久成. 基于自适应邻域互信息与谱聚类的特征选择[J]. 《山东大学学报(理学版)》, 2022, 57(12): 13-24.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王刚,许信顺*. 一种新的基于多示例学习的场景分类方法[J]. J4, 2010, 45(7): 108 -113 .
[2] 陆玮洁,主沉浮,宋 翠,杨艳丽 . 中药郁金中无机离子的毛细管电泳法测定[J]. J4, 2007, 42(7): 13 -18 .
[3] 赵君1,赵晶2,樊廷俊1*,袁文鹏1,3,张铮1,丛日山1. 水溶性海星皂苷的分离纯化及其抗肿瘤活性研究[J]. J4, 2013, 48(1): 30 -35 .
[4] 杨永伟1,2,贺鹏飞2,李毅君2,3. BL-代数的严格滤子[J]. 山东大学学报(理学版), 2014, 49(03): 63 -67 .
[5] 韩亚飞,伊文慧,王文波,王延平,王华田*. 基于高通量测序技术的连作杨树人工林土壤细菌多样性研究[J]. 山东大学学报(理学版), 2014, 49(05): 1 -6 .
[6] 解树涛,宋晓妍,石梅,陈秀兰,孙彩云,张玉忠* . 康宁木霉(Trichoderma koningii)SMF2分泌的peptaibols类抗菌肽Trichokonins抑菌活性研究[J]. J4, 2006, 41(6): 140 -144 .
[7] 刘保仓,史开泉 . S-粗集的信度特征[J]. J4, 2006, 41(5): 26 -29 .
[8] 唐风琴1,白建明2. 一类带有广义负上限相依索赔额的风险过程大偏差[J]. J4, 2013, 48(1): 100 -106 .
[9] 王 瑶,刘 建,王仁卿,* . 阿利效应及其对生物入侵和自然保护中小种群管理的启示[J]. J4, 2007, 42(1): 76 -82 .
[10] 谢娟英1, 2,张琰1,谢维信2, 3,高新波2. 一种新的密度加权粗糙K-均值聚类算法[J]. J4, 2010, 45(7): 1 -6 .