您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2024, Vol. 59 ›› Issue (3): 61-70.doi: 10.6040/j.issn.1671-9352.7.2023.1073

•   • 上一篇    下一篇

基于样本相关性的层次特征选择算法

史春雨1,2(),毛煜1,2,*(),刘浩阳1,2,林耀进1,2   

  1. 1. 闽南师范大学计算机学院, 福建 漳州 363000
    2. 数据科学与智能应用重点实验室(闽南师范大学), 福建 漳州 363000
  • 收稿日期:2023-04-29 出版日期:2024-03-20 发布日期:2024-03-06
  • 通讯作者: 毛煜 E-mail:shichunyuuu@163.com;maoyu_bit@163.com
  • 作者简介:史春雨(1997—),女,硕士研究生,研究方向为数据挖掘. E-mail:shichunyuuu@163.com
  • 基金资助:
    国家自然科学基金资助项目(62076116);福建省自然科学基金资助项目(2022J01914)

Hierarchical feature selection algorithm based on instance correlations

Chunyu SHI1,2(),Yu MAO1,2,*(),Haoyang LIU1,2,Yaojin LIN1,2   

  1. 1. School of Computer Science, Minnan Normal University, Zhangzhou 363000, Fujian, China
    2. Key Laboratory of Data Science and Intelligence Application(Minnan Normal University), Zhangzhou 363000, Fujian, China
  • Received:2023-04-29 Online:2024-03-20 Published:2024-03-06
  • Contact: Yu MAO E-mail:shichunyuuu@163.com;maoyu_bit@163.com

摘要:

提出了基于样本相关性的层次特征选择算法(hierarchical feature selection algorithm based on instance correlations, HFSIC)以进一步提高分层分类特征选择算法的性能。在使用稀疏正则项去除不相关特征之后, 将层次结构中的父子关系与特征空间中样本之间的重构关系相结合, 学习同一子树下各类别的样本相关性, 利用递归正则优化输出特征权重矩阵。在衡量样本相关性时, 将重构系数矩阵整合到训练模型中, 同时利用l2, 1范数去除不相关的和冗余的特征。使用加速近端梯度法解决所提模型的优化问题, 并在多个评价指标下评估所提算法的优越性。实验结果表明, 所提方法在5个数据集上的表现优于其他算法, 验证了该算法的有效性。

关键词: 特征选择, 层次结构, 样本相关性, 递归正则化

Abstract:

A hierarchical feature selection algorithm based on instance correlations (HFSIC) is proposed to further improve the performance of the hierarchical feature selection algorithm. After using sparse regularization items to remove irrelevant features, the parent-child relationship in the hierarchical structure with the reconstruction relationship between samples in the feature space are combined. The correlation of samples of each category under the same subtree are learned. Recursive regularization to optimize the output features weight matrix is used. When measuring the sample correlation, the reconstructed coefficient matrix is integrated into the training model, and the norm is used to remove irrelevant and redundant features. The optimization problem of the proposed model is solved using the accelerated proximal gradient method, and the superiority of the proposed algorithm is evaluated under multiple evaluation metrics. The experimental results show that the proposed method outperforms the other algorithms on five datasets. The test verifies the effectiveness of the proposed algorithm.

Key words: feature selection, hierarchical structure, instance correlation, recursive regularization

中图分类号: 

  • TP391

表1

数据集描述"

序号 数据集 训练集数 测试集数 特征数 节点数 叶子节点数 层数
1 DD 3 020 605 473 32 27 3
2 F194 7 105 1 420 473 202 194 3
3 VOC 7 178 5 105 1 000 30 20 5
4 CLEF 8 368 939 80 88 63 4
5 ILSVRC65 12 346 11 845 4 096 65 57 4

表2

不同特征选择算法在不同数据集上的标准TIE结果(↓)"

数据集 TTIE
HierFSNM HiermRMR Hier-FS HiRRfam-FS HFSDK HFSIC
F194 0.212 3(6) 0.180 0(5) 0.174 6(3) 0.173 0(2) 0.175 2(4) 0.166 0(1)
DD 0.088 6(5) 0.091 9(6) 0.085 0(3) 0.083 6(1) 0.086 3(4) 0.083 9(2)
ILSVRC65 0.035 0(5) 0.033 5(6) 0.032 8(2) 0.032 8(2) 0.032 9(4) 0.032 6(1)
VOC 0.214 4(5) 0.2188(6) 0.214 3(3) 0.214 3(3) 0.212 6(2) 0.208 7(1)
CLEF 0.207 7(6) 0.182 5(3) 0.182 6(4) 0.182 6(4) 0.174 5(2) 0.173 5(1)
平均排名 5.4 5.2 3 2.4 3.2 1.2

表3

不同特征选择算法在不同数据集上Hierarchical-F1 measure结果(↑)"

数据集 FH
HierFSNM HiermRMR Hier-FS HiRRfam-FS HFSDK HFSIC
F194 0.646 2(6) 0.700 0(5) 0.708 9(3) 0.711 2(2) 0.707 5(4) 0.712 7(1)
DD 0.852 4(5) 0.846 8(6) 0.858 4(3) 0.860 6(1) 0.859 0(2) 0.858 4(3)
ILSVRC65 0.956 3(6) 0.958 1(5) 0.959 1(2) 0.958 8(4) 0.958 9(3) 0.959 2(1)
VOC 0.673 9(5) 0.666 9(6) 0.675 4(3) 0.675 8(3) 0.677 2(2) 0.682 0(1)
CLEF 0.739 6(6) 0.763 5(3) 0.763 1(4) 0.762 3(5) 0.774 2(2) 0.775 5(1)
平均排名 5.6 5 3 3 2.6 1.4

图1

通过Bonferroni-Dunn检验比较HFSIC算法与其他算法的性能"

图2

基于F194和VOC数据集的消融实验结果"

图3

基于F194数据集的参数敏感性分析"

图4

基于VOC数据集的参数敏感性分析"

图5

目标函数值的收敛曲线"

1 王忠伟, 陈叶芳, 钱江波, 等. 基于LSH的高维大数据k近邻搜索算法[J]. 电子学报, 2016, 44 (4): 906- 912.
doi: 10.3969/j.issn.0372-2112.2016.04.022
WANG Zhongwei , CHEN Yefang , QIAN Jiangbo , et al. LSH-based algorithm for k nearest neighbor search on bigdata[J]. Acta Electronica Sinica, 2016, 44 (4): 906- 912.
doi: 10.3969/j.issn.0372-2112.2016.04.022
2 胡清华, 王煜, 周玉灿, 等. 大规模分类任务的分层学习方法综述[J]. 中国科学(信息科学), 2018, 48 (5): 487- 500.
HU Qinghua , WANG Yu , ZHOU Yucan , et al. A review on hierarchical learning methods for large scale classification task[J]. Sci Sin Inform, 2018, 48 (5): 487- 500.
3 DUDA R O , HART P E , STORK D G . Pattern classification[M]. Hoboken: Wiley, 2000.
4 LIU Xinxin , ZHOU Yucan , ZHAO Hong . Robust hierarchical feature selection driven by data and knowledge[J]. Information Sciences, 2021, 551, 341- 357.
doi: 10.1016/j.ins.2020.11.003
5 WANG Jian , ZHANG Huaqing , WANG Junze , et al. Feature selection using a neural network with group lasso regularization and controlled redundancy[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 32 (3): 1110- 1123.
6 林耀进, 白盛兴, 赵红, 等. 基于标签关联性的分层分类共有与固有特征选择[J]. 软件学报, 2022, 33 (7): 2667- 2682.
LIN Yaojin , BAI Shengxing , ZHAO Hong , et al. A label correlation based common and specific feature selection for large-scale hierarchical classification[J]. Journal of Software, 2022, 33 (7): 2667- 2682.
7 FREEMAN C, KULIC D, BASIR O. Joint feature selection and hierarchical classifier design[C]//2011 IEEE International Conference on Systems, Man and Cybernetics. Waterloo: IEEE, 2011: 1728-1734.
8 FREEMAN C , KULIC D , BASIR O , et al. Feature-selected tree-based classification[J]. IEEE Transactions on Cybernetics, 2013, 43 (6): 1990- 2004.
doi: 10.1109/TSMCB.2012.2237394
9 GRIMAUDO L, MELLIA M, BARALIS E. Hierarchical learning for fine grained internet traffic classification[C]//2012 8th International Wireless Communications and Mobile Computing Conference (IWCMC). Copenhagen: IEEE, 2012: 463-468.
10 ZHAO Hong , HU Qinghua , ZHU Pengfei , et al. A recursive regularization based feature selection framework for hierarchical classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2021, 33 (7): 2833- 2846.
doi: 10.1109/TKDE.2019.2960251
11 TUO Qianjuan , ZHAO Hong , HU Qinghua . Hierarchical feature selection with subtree based graph regularization[J]. Knowledge-Based Systems, 2018, 163 (1): 996- 1008.
12 DE ABREU I B M, MANTOVANI R G, CERRI R. Incorporating instance correlations in multi-label classification via label-space[C]//2017 International Joint Conference on Neural Networks (IJCNN). Anchorage: IEEE, 2017: 581-588.
13 HUANG Shengjun, ZHOU Zhihua. Multi-label learning by exploiting label correlations locally[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Toronto, Ontario: AAAI, 2012, 26(1): 949-955.
14 HUANG Jun , LI Guorong , HUANG Qingming , et al. Joint feature selection and classification for multilabel learning[J]. IEEE Transactions on Cybernetics, 2018, 48 (3): 876- 889.
doi: 10.1109/TCYB.2017.2663838
15 LI Junlong , LI Peipei , HU Xuegang , et al. Learning common and label-specific features for multi-label classification with correlation information[J]. Pattern Recognition, 2022, 121, 108- 259.
16 LI Jundong , CHENG Kewei , WANG Suhang , et al. Feature selection: a data perspective[J]. ACM Computing Surveys (CSUR), 2017, 50 (6): 1- 45.
17 刘浩阳, 林耀进, 刘景华, 等. 由粗到细的分层特征选择[J]. 电子学报, 2022, 50 (11): 2778- 2789.
LIU Haoyang , LIN Yaojin , LIU Jinghua , et al. Hierarchical feature selection from coarse to fine[J]. Acta Electronica Sinica, 2022, 50 (11): 2778- 2789.
18 LIN Zhouchen , GANESH A , WRIGHT J , et al. Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix[J]. Computational Advances, 2009, 10, 1- 18.
19 DEKEL O, KESHET J, SINGER Y. Large margin hierarchical classification[C]//Proceedings of the Twenty-first International Conference on Machine Learning. New York: ACM, 2004: 1-8.
20 SILLA C N , FREITAS A A . A survey of hierarchical classification across different application domains[J]. Data Mining & Knowledge Discovery, 2011, 22 (1/2): 31- 72.
21 NIE Feiping, HUANG Heng, CAI Xiao, et al. Efficient and robust feature selection via joint ℓ2, 1-norms minimization[C]//Proceedings of the 23rd International Conference on Neural Information Processing Systems. Kyoto: IEEE, 2010: 1813-1821.
22 PENG Hanchuan , LONG Fuhui , DING C . Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27 (8): 1226- 1238.
doi: 10.1109/TPAMI.2005.159
23 DEMIAR J , SCHUURMAMS D . Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7 (1): 1- 30.
[1] 张志浩,林耀进,卢舜,吴镒潾,王晨曦. 流缺失标记环境下的多标记特征选择[J]. 《山东大学学报(理学版)》, 2022, 57(8): 39-52.
[2] 孙林,陈雨生,徐久成. 基于改进ReliefF的多标记特征选择算法[J]. 《山东大学学报(理学版)》, 2022, 57(4): 1-11.
[3] 孙林,梁娜,徐久成. 基于自适应邻域互信息与谱聚类的特征选择[J]. 《山东大学学报(理学版)》, 2022, 57(12): 13-24.
[4] 张要,马盈仓,杨小飞,朱恒东,杨婷. 结合流形结构与柔性嵌入的多标签特征选择[J]. 《山东大学学报(理学版)》, 2021, 56(7): 91-102.
[5] 黄天意,祝峰. 基于流形学习的代价敏感特征选择[J]. 山东大学学报(理学版), 2017, 52(3): 91-96.
[6] 万中英,王明文,左家莉,万剑怡. 结合全局和局部信息的特征选择算法[J]. 山东大学学报(理学版), 2016, 51(5): 87-93.
[7] 李钊,孙占全,李晓,李诚. 基于信息损失量的特征选择方法研究及应用[J]. 山东大学学报(理学版), 2016, 51(11): 7-12.
[8] 郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报(理学版), 2014, 49(11): 74-81.
[9] 夏梦南, 杜永萍, 左本欣. 基于依存分析与特征组合的微博情感分析[J]. 山东大学学报(理学版), 2014, 49(11): 22-30.
[10] 于然1,2,刘春阳3*,靳小龙1,王元卓1,程学旗1. 基于多视角特征融合的中文垃圾微博过滤[J]. J4, 2013, 48(11): 53-58.
[11] 易超群,李建平,朱成文. 一种基于分类精度的特征选择支持向量机[J]. J4, 2010, 45(7): 119-121.
[12] 杨玉珍 刘培玉 朱振方 邱烨. 应用特征项分布信息的信息增益改进方法研究[J]. J4, 2009, 44(11): 48-51.
[13] 袁晓航,杜小勇 . iRIPPER——一种改进的基于规则学习的文本分类算法[J]. J4, 2007, 42(11): 66-68 .
[14] 李 森,马 军,赵 嫣,雷景生, . 对数字化科技论文的自动分类研究[J]. J4, 2006, 41(3): 81-84 .
[15] 余俊英,王明文,盛 俊 . 文本分类中的类别信息特征选择方法[J]. J4, 2006, 41(3): 144-148 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 郭乔进,丁轶,李宁. 一种基于上下文信息的乳腺肿块ROI检测方法[J]. J4, 2010, 45(7): 70 -75 .
[2] 付海艳,卢昌荆,史开泉 . (F,F-)-规律推理与规律挖掘[J]. J4, 2007, 42(7): 54 -57 .
[3] 胡明娣1,2,折延宏1,王敏3. L3*系统中逻辑度量空间的拓扑性质[J]. J4, 2010, 45(6): 86 -90 .
[4] 张 慧 . 不完全信息下推广的递归偏好[J]. J4, 2006, 41(1): 62 -68 .
[5] 刘洪华 . 色散方程的交替分组迭代方法[J]. J4, 2007, 42(1): 19 -23 .
[6] 刘昆仑. 变结构pair copula模型在金融危机传染分析中的应用[J]. 山东大学学报(理学版), 2016, 51(6): 104 -110 .
[7] 汤晓宏1,胡文效2*,魏彦锋2,蒋锡龙2,张晶莹2,. 葡萄酒野生酿酒酵母的筛选及其生物特性的研究[J]. 山东大学学报(理学版), 2014, 49(03): 12 -17 .
[8] 袁瑞强,刘贯群,张贤良,高会旺 . 黄河三角洲浅层地下水中氢氧同位素的特征[J]. J4, 2006, 41(5): 138 -143 .
[9] 郭文鹃,杨公平*,董晋利. 指纹图像分割方法综述[J]. J4, 2010, 45(7): 94 -101 .
[10] 张雯,张化祥*,李明方,计华. 决策树构建方法:向前两步优于一步[J]. J4, 2010, 45(7): 114 -118 .