《山东大学学报(理学版)》 ›› 2024, Vol. 59 ›› Issue (12): 130-140.doi: 10.6040/j.issn.1671-9352.7.2023.5296
• • 上一篇
汪廷华,胡振威,占宏祥
WANG Tinghua, HU Zhenwei, ZHAN Hongxiang
摘要: 大多数基于HSIC的特征选择方法都受到以下限制。首先,这些方法通常只适用于有标记的数据,这是不够的,因为现实世界应用中的大多数数据都是未标记的。其次,现有的基于HSIC的无监督特征选择方法只解决了所选特征与表达底层聚类结构的输出值之间的一般相关性,而忽略了不同特征之间的冗余。为了解决这些问题,提出了一种新的基于HSIC的无监督特征选择方法(UFSHSIC),该方法使用HSIC作为相关性准则来探索特征与总体样本结构之间的相关性及特征与特征之间的冗余度。与其它经典特征选择学习方法在多个真实数据集上的实验对比表明,该方法可以有效从无标签样本中进行特征选择,且选择的特征子集相比有监督特征选择方法而言能产生类似或更好的性能。
中图分类号:
[1] CAI Jie, LUO Jiawei, WANG Shulin, et al. Feature selection in machine learning: a new perspective[J]. Neurocomputing, 2018, 300:70-79. [2] 许行,张凯,王文剑. 一种小样本数据的特征选择方法[J]. 计算机研究与发展,2018,55(10):2321-2330. XU Hang, ZHANG Kai, WANG Wenjian. A feature selection method for small samples[J]. Journal of Computer Research and Development, 2018, 55(10):2321-2330. [3] HUANG Rui, WU Zhejun. Multi-label feature selection via manifold regularization and dependence maximization[J]. Pattern Recognition, 2021, 120(8):108149. [4] SOLORIO-FERNÁNDEZ S, CARRASCO-OCHOA J, MARTÍNEZ-TRINIDAD J. A review of unsupervised feature selection methods[J]. Artificial Intelligence Review, 2020, 53(2):907-948. [5] AGGARWAL C, REDDY K. Data clustering[M]. New York: Chapman and Hall, 2014:29-60. [6] LI Jundong, CHENG Kewei, WANG Suhang, et al. Feature selection: a data perspective[J]. ACM Computing Surveys, 2017, 50(6):1-45. [7] SONG L, SMOLA A, GRETTON A, et al. Feature selection via dependence maximization[J]. Journal of Machine Learning Research, 2012, 13(5):1393-1434. [8] SONG L, BEDO J, BORGWARDT K M, et al. Gene selection via the BAHSIC family of algorithms[J]. Bioinformatics, 2007, 23(13):i490-i498. [9] 刘艳芳,李文斌,高阳. 基于自适应邻域嵌入的无监督特征选择算法[J]. 计算机研究与发展,2020,57(8):1639-1649. LIU Yanfang, LI Wenbin, GAO Yang. Adaptive neighborhood embedding based unsupervised feature selection[J]. Journal of Computer Research and Development, 2020, 57(8):1639-1649. [10] LIAGHAT S, MANSOORI E. Filter-based unsupervised feature selection using Hilbert-Schmidt independence criterion[J]. International Journal of Machine Learning and Cybernetics, 2019, 10(9):2313-2328. [11] LUO Minna, NIE Feiping, CHANG Xiaojun, et al. Adaptive unsupervised feature selection with structure regularization[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(4):944-956. [12] GUYON I, ELISSEEFF A. An introduction to variable and feature selection[J]. Journal of Machine Learning Research, 2003, 3:1157-1182. [13] DEVAKUMARI D, THANGAVEL K. Unsupervised adaptive floating search feature selection based on contribution entropy[C] //Proceedings of the International Conference on Communication and Computational Intelligence. Erode, India: IEEE, 2010:623-627. [14] HUANG Rui, JIANG Weidong, SUN Guangling. Manifold-based constraint Laplacian score for multi-label feature selection[J]. Pattern Recognition Letters, 2018, 112:346-352. [15] MA Rui, WANG Yijie, CHENG Li. Feature selection on data stream via multi-cluster structure preservation[C] //Proceedings of the 29th ACM International Conference on Information and Knowledge Management. New York: ACM, 2020:1065-1074. [16] YANG Meng, SHANG Ronghua, JIAO Licheng, et al. Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering[J]. Neurocomputing, 2018, 290:87-99. [17] YANG Yi, SHEN Hengtao, MA Zhigang, et al. l2,1-norm regularized discriminative feature selection for unsupervised learning[C] //Proceedings of the 22th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann, 2011:1589-1594. [18] ZHU Pengfei, XU Qian, HU Qinghua, et al. Co-regularized unsupervised feature selection[J]. Neurocomputing, 2018, 275:2855-2863. [19] BEHURA A. Data analytics in bioinformatics: a machine learning perspective, Massachusetts[M]. Beverly, Massachusetts: Scrivener Publishing, 2021:249-280. [20] YAN X, NAZMI S, EROL B A, et al. An efficient unsupervised feature selection procedure through feature clustering[J]. Pattern Recognition Letters, 2020, 131:277-284. [21] 张晨光,张燕,张夏欢. 最大规范化依赖性多标记半监督学习方法[J]. 自动化学报,2015,41(9):1577-1588 ZHANG Chenguang, ZHANG Yan, ZHANG Xiahuan. Maximum normative-dependent multilabel semi-supervised learning methods[J]. Acta Automatica Sinica, 2015, 41(9):1577-1588. [22] GRETTON A, BOUSQUET O, SMOLA A, et al. Measuring statistical dependence with Hilbert-Schmidt norms[C] //Proceedings of the 16th International Conference on Algorithmic Learning Theory. Berlin: Springer-Verlag, 2005, 63-77. [23] WANG Tinghua, DAI Xiaolu, LIU Yuze. Learning with Hilbert-Schmidt independence criterion: a review and new perspectives[J]. Knowledge-based Systems, 2021, 234:107567. [24] 王博,黄九鸣,贾焰,等. 适用于多种监督模型的特征选择方法研究[J]. 计算机研究与发展,2010,47(9):1548-1557. WANG Bo, HUANG Jiuming, JIA Yan, et al. A study of feature selection methods applicable to multiple supervised models[J]. Journal of Computer Research and Development, 2010, 47(9):1548-1557. [25] 胡振威,汪廷华,周慧颖. 基于核统计独立性准则的特征选择研究综述[J]. 计算机工程与应用,2022,58(22):54-64. HU Zhenwei, WANG Tinghua, ZHOU Huiying. A review of feature selection methods based on kernel statistical independence criteria[J]. Computer Engineering and Applications, 2022,58(22):54-64. [26] ABUGABAH A, ALZUBI A A, AL-OBEIDAT F, et al. Data mining techniques for analyzing healthcare conditions of urban space-person lung using meta-heuristic optimized neural networks[J]. Cluster Computing, 2020, 23(3):1781-1794. [27] REN Weijie, LI Baisong, HAN Min. A novel granger causality method based on HSIC-Lasso for revealing nonlinear relationship between multivariate time series[J]. Physica A: Statistical Mechanics and Its Applications, 2020, 541:123245. [28] JO I, LEE S, OH S. Improved measures of redundancy and relevance for mRMR feature selection[J]. Computers, 2019, 8(2):42. [29] 刘吉超,王锋. 基于Relief-F的半监督特征选择算法[J]. 郑州大学学报(理学版),2021,53(1):42-46. LIU Jichao, WANG Feng. A semi supervised feature selection algorithm based on Relief-F[J]. Journal of Zhengzhou University(Natural Science Edition), 2021, 53(1):42-46. [30] SINAGA K P, YANG M S. Unsupervised k-means clustering algorithm[J]. IEEE Access, 2020, 8:80716-80727. [31] UCI machine learning repository[DB/OL]. [2022-12-29]. http://archive.ics.uci.edu/ml/index.php. |
[1] | 李绎冉,赵宁,张志坚. 多服务器串联排队系统中平均排队时间的预测[J]. 《山东大学学报(理学版)》, 2024, 59(1): 17-26. |
[2] | 张杰,彭国军,杨秀璋. 基于动态API调用序列和机器学习的恶意逃避样本检测方法[J]. 《山东大学学报(理学版)》, 2022, 57(7): 85-93. |
[3] | 李颖,张国林. 互信息和核熵成分分析的油中溶解气体浓度建模[J]. 《山东大学学报(理学版)》, 2022, 57(7): 43-52. |
[4] | 周安民,户磊,刘露平,贾鹏,刘亮. 基于熵时间序列的恶意Office文档检测技术[J]. 《山东大学学报(理学版)》, 2019, 54(5): 1-7. |
[5] | 刘铭, 昝红英, 原慧斌. 基于SVM与RNN的文本情感关键句判定与抽取[J]. 山东大学学报(理学版), 2014, 49(11): 68-73. |
[6] | 潘清清,周枫,余正涛,郭剑毅,线岩团. 基于条件随机场的越南语命名实体识别方法[J]. 山东大学学报(理学版), 2014, 49(1): 76-79. |
[7] | 杜瑞颖, 杨勇, 陈晶, 王持恒. 一种基于相似度的高效网络流量识别方案[J]. 山东大学学报(理学版), 2014, 49(09): 109-114. |
[8] | 孙忠贵1,陈杰2. 数据驱动核的非局部均值滤波器改进[J]. 山东大学学报(理学版), 2014, 49(05): 24-27. |
[9] | 董源1,徐雅斌1,2*,李卓1,2,李艳平1. 基于社会计算和机器学习的垃圾邮件识别方法的研究[J]. J4, 2013, 48(7): 72-78. |
[10] | 黄林晟1,邓志鸿1,2,唐世渭1,2,王文清3,陈凌3. 基于编辑距离的中文组织机构名简称-全称匹配算法[J]. J4, 2012, 47(5): 43-48. |
[11] | 王鹏鸣,钟茂生,刘遵雄. 去噪空间上的多核学习[J]. J4, 2012, 47(5): 49-52. |
[12] | 唐都钰1,王大亮2,赵凯2,秦兵1,刘挺1. 面向汽车领域的软文识别研究[J]. J4, 2012, 47(3): 43-46. |
[13] | 黄贤立,罗冬梅. 倾向性文本迁移学习中的特征重要性研究[J]. J4, 2010, 45(7): 13-17. |
[14] | 曾文赋1,黄添强1,2,李凯1,余养强1,郭躬德1,2. 基于调和平均测地线核的局部线性嵌入算法[J]. J4, 2010, 45(7): 55-59. |
[15] | 万海平,何华灿,周延泉 . 局部核方法及其应用[J]. J4, 2006, 41(3): 18-20 . |
|