您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2024, Vol. 59 ›› Issue (12): 130-140.doi: 10.6040/j.issn.1671-9352.7.2023.5296

• • 上一篇    

一种新颖的无监督特征选择方法

汪廷华,胡振威,占宏祥   

  1. 赣南师范大学数学与计算机科学学院, 江西 赣州 341000
  • 发布日期:2024-12-12
  • 基金资助:
    国家自然科学基金资助项目(61966002);江西省学位与研究生教育教学改革研究项目(JXYJG-2022-172)

A novel unsupervised feature selection method

WANG Tinghua, HU Zhenwei, ZHAN Hongxiang   

  1. School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, Jiangxi, China
  • Published:2024-12-12

摘要: 大多数基于HSIC的特征选择方法都受到以下限制。首先,这些方法通常只适用于有标记的数据,这是不够的,因为现实世界应用中的大多数数据都是未标记的。其次,现有的基于HSIC的无监督特征选择方法只解决了所选特征与表达底层聚类结构的输出值之间的一般相关性,而忽略了不同特征之间的冗余。为了解决这些问题,提出了一种新的基于HSIC的无监督特征选择方法(UFSHSIC),该方法使用HSIC作为相关性准则来探索特征与总体样本结构之间的相关性及特征与特征之间的冗余度。与其它经典特征选择学习方法在多个真实数据集上的实验对比表明,该方法可以有效从无标签样本中进行特征选择,且选择的特征子集相比有监督特征选择方法而言能产生类似或更好的性能。

关键词: 无监督特征选择, 希尔伯特-施密特独立性准则, 核方法, 机器学习, 特征冗余

Abstract: Most HSIC-based feature selection methods are subject to the following limitations. First, these methods are typically only applied to labeled data, which is not feasible since most of the data in real-world applications is unlabeled. Second, existing HSIC-based unsupervised feature selection methods only address the general correlation between the selected features and the output values representing the underlying clustering structure, while ignoring the redundancy between different features. To address these issues, a new unsupervised feature selection method based on HSIC(UFSHSIC)is proposed, which utilizes HSIC as a correlation criterion to explore the correlation between features and the overall sample structure, as well as the redundancy between features. Experimental comparison with other classical feature selection methods on multiple real datasets shows that the proposed method can effectively perform feature selection from unlabeled samples, and the selected feature subset produces equivalent or better performance than supervised feature selection methods.

Key words: unsupervised feature selection, Hilbert-Schmidt independence criterion(HSIC), kernel method, machine learning, feature redundancy

中图分类号: 

  • TP181
[1] CAI Jie, LUO Jiawei, WANG Shulin, et al. Feature selection in machine learning: a new perspective[J]. Neurocomputing, 2018, 300:70-79.
[2] 许行,张凯,王文剑. 一种小样本数据的特征选择方法[J]. 计算机研究与发展,2018,55(10):2321-2330. XU Hang, ZHANG Kai, WANG Wenjian. A feature selection method for small samples[J]. Journal of Computer Research and Development, 2018, 55(10):2321-2330.
[3] HUANG Rui, WU Zhejun. Multi-label feature selection via manifold regularization and dependence maximization[J]. Pattern Recognition, 2021, 120(8):108149.
[4] SOLORIO-FERNÁNDEZ S, CARRASCO-OCHOA J, MARTÍNEZ-TRINIDAD J. A review of unsupervised feature selection methods[J]. Artificial Intelligence Review, 2020, 53(2):907-948.
[5] AGGARWAL C, REDDY K. Data clustering[M]. New York: Chapman and Hall, 2014:29-60.
[6] LI Jundong, CHENG Kewei, WANG Suhang, et al. Feature selection: a data perspective[J]. ACM Computing Surveys, 2017, 50(6):1-45.
[7] SONG L, SMOLA A, GRETTON A, et al. Feature selection via dependence maximization[J]. Journal of Machine Learning Research, 2012, 13(5):1393-1434.
[8] SONG L, BEDO J, BORGWARDT K M, et al. Gene selection via the BAHSIC family of algorithms[J]. Bioinformatics, 2007, 23(13):i490-i498.
[9] 刘艳芳,李文斌,高阳. 基于自适应邻域嵌入的无监督特征选择算法[J]. 计算机研究与发展,2020,57(8):1639-1649. LIU Yanfang, LI Wenbin, GAO Yang. Adaptive neighborhood embedding based unsupervised feature selection[J]. Journal of Computer Research and Development, 2020, 57(8):1639-1649.
[10] LIAGHAT S, MANSOORI E. Filter-based unsupervised feature selection using Hilbert-Schmidt independence criterion[J]. International Journal of Machine Learning and Cybernetics, 2019, 10(9):2313-2328.
[11] LUO Minna, NIE Feiping, CHANG Xiaojun, et al. Adaptive unsupervised feature selection with structure regularization[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(4):944-956.
[12] GUYON I, ELISSEEFF A. An introduction to variable and feature selection[J]. Journal of Machine Learning Research, 2003, 3:1157-1182.
[13] DEVAKUMARI D, THANGAVEL K. Unsupervised adaptive floating search feature selection based on contribution entropy[C] //Proceedings of the International Conference on Communication and Computational Intelligence. Erode, India: IEEE, 2010:623-627.
[14] HUANG Rui, JIANG Weidong, SUN Guangling. Manifold-based constraint Laplacian score for multi-label feature selection[J]. Pattern Recognition Letters, 2018, 112:346-352.
[15] MA Rui, WANG Yijie, CHENG Li. Feature selection on data stream via multi-cluster structure preservation[C] //Proceedings of the 29th ACM International Conference on Information and Knowledge Management. New York: ACM, 2020:1065-1074.
[16] YANG Meng, SHANG Ronghua, JIAO Licheng, et al. Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering[J]. Neurocomputing, 2018, 290:87-99.
[17] YANG Yi, SHEN Hengtao, MA Zhigang, et al. l2,1-norm regularized discriminative feature selection for unsupervised learning[C] //Proceedings of the 22th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann, 2011:1589-1594.
[18] ZHU Pengfei, XU Qian, HU Qinghua, et al. Co-regularized unsupervised feature selection[J]. Neurocomputing, 2018, 275:2855-2863.
[19] BEHURA A. Data analytics in bioinformatics: a machine learning perspective, Massachusetts[M]. Beverly, Massachusetts: Scrivener Publishing, 2021:249-280.
[20] YAN X, NAZMI S, EROL B A, et al. An efficient unsupervised feature selection procedure through feature clustering[J]. Pattern Recognition Letters, 2020, 131:277-284.
[21] 张晨光,张燕,张夏欢. 最大规范化依赖性多标记半监督学习方法[J]. 自动化学报,2015,41(9):1577-1588 ZHANG Chenguang, ZHANG Yan, ZHANG Xiahuan. Maximum normative-dependent multilabel semi-supervised learning methods[J]. Acta Automatica Sinica, 2015, 41(9):1577-1588.
[22] GRETTON A, BOUSQUET O, SMOLA A, et al. Measuring statistical dependence with Hilbert-Schmidt norms[C] //Proceedings of the 16th International Conference on Algorithmic Learning Theory. Berlin: Springer-Verlag, 2005, 63-77.
[23] WANG Tinghua, DAI Xiaolu, LIU Yuze. Learning with Hilbert-Schmidt independence criterion: a review and new perspectives[J]. Knowledge-based Systems, 2021, 234:107567.
[24] 王博,黄九鸣,贾焰,等. 适用于多种监督模型的特征选择方法研究[J]. 计算机研究与发展,2010,47(9):1548-1557. WANG Bo, HUANG Jiuming, JIA Yan, et al. A study of feature selection methods applicable to multiple supervised models[J]. Journal of Computer Research and Development, 2010, 47(9):1548-1557.
[25] 胡振威,汪廷华,周慧颖. 基于核统计独立性准则的特征选择研究综述[J]. 计算机工程与应用,2022,58(22):54-64. HU Zhenwei, WANG Tinghua, ZHOU Huiying. A review of feature selection methods based on kernel statistical independence criteria[J]. Computer Engineering and Applications, 2022,58(22):54-64.
[26] ABUGABAH A, ALZUBI A A, AL-OBEIDAT F, et al. Data mining techniques for analyzing healthcare conditions of urban space-person lung using meta-heuristic optimized neural networks[J]. Cluster Computing, 2020, 23(3):1781-1794.
[27] REN Weijie, LI Baisong, HAN Min. A novel granger causality method based on HSIC-Lasso for revealing nonlinear relationship between multivariate time series[J]. Physica A: Statistical Mechanics and Its Applications, 2020, 541:123245.
[28] JO I, LEE S, OH S. Improved measures of redundancy and relevance for mRMR feature selection[J]. Computers, 2019, 8(2):42.
[29] 刘吉超,王锋. 基于Relief-F的半监督特征选择算法[J]. 郑州大学学报(理学版),2021,53(1):42-46. LIU Jichao, WANG Feng. A semi supervised feature selection algorithm based on Relief-F[J]. Journal of Zhengzhou University(Natural Science Edition), 2021, 53(1):42-46.
[30] SINAGA K P, YANG M S. Unsupervised k-means clustering algorithm[J]. IEEE Access, 2020, 8:80716-80727.
[31] UCI machine learning repository[DB/OL]. [2022-12-29]. http://archive.ics.uci.edu/ml/index.php.
[1] 李绎冉,赵宁,张志坚. 多服务器串联排队系统中平均排队时间的预测[J]. 《山东大学学报(理学版)》, 2024, 59(1): 17-26.
[2] 张杰,彭国军,杨秀璋. 基于动态API调用序列和机器学习的恶意逃避样本检测方法[J]. 《山东大学学报(理学版)》, 2022, 57(7): 85-93.
[3] 李颖,张国林. 互信息和核熵成分分析的油中溶解气体浓度建模[J]. 《山东大学学报(理学版)》, 2022, 57(7): 43-52.
[4] 周安民,户磊,刘露平,贾鹏,刘亮. 基于熵时间序列的恶意Office文档检测技术[J]. 《山东大学学报(理学版)》, 2019, 54(5): 1-7.
[5] 刘铭, 昝红英, 原慧斌. 基于SVM与RNN的文本情感关键句判定与抽取[J]. 山东大学学报(理学版), 2014, 49(11): 68-73.
[6] 潘清清,周枫,余正涛,郭剑毅,线岩团. 基于条件随机场的越南语命名实体识别方法[J]. 山东大学学报(理学版), 2014, 49(1): 76-79.
[7] 杜瑞颖, 杨勇, 陈晶, 王持恒. 一种基于相似度的高效网络流量识别方案[J]. 山东大学学报(理学版), 2014, 49(09): 109-114.
[8] 孙忠贵1,陈杰2. 数据驱动核的非局部均值滤波器改进[J]. 山东大学学报(理学版), 2014, 49(05): 24-27.
[9] 董源1,徐雅斌1,2*,李卓1,2,李艳平1. 基于社会计算和机器学习的垃圾邮件识别方法的研究[J]. J4, 2013, 48(7): 72-78.
[10] 黄林晟1,邓志鸿1,2,唐世渭1,2,王文清3,陈凌3. 基于编辑距离的中文组织机构名简称-全称匹配算法[J]. J4, 2012, 47(5): 43-48.
[11] 王鹏鸣,钟茂生,刘遵雄. 去噪空间上的多核学习[J]. J4, 2012, 47(5): 49-52.
[12] 唐都钰1,王大亮2,赵凯2,秦兵1,刘挺1. 面向汽车领域的软文识别研究[J]. J4, 2012, 47(3): 43-46.
[13] 黄贤立,罗冬梅. 倾向性文本迁移学习中的特征重要性研究[J]. J4, 2010, 45(7): 13-17.
[14] 曾文赋1,黄添强1,2,李凯1,余养强1,郭躬德1,2. 基于调和平均测地线核的局部线性嵌入算法[J]. J4, 2010, 45(7): 55-59.
[15] 万海平,何华灿,周延泉 . 局部核方法及其应用[J]. J4, 2006, 41(3): 18-20 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 郭婷婷,孔文涛,孔 健*,季明杰 . 乳酸乳球菌nisin抗性基因的克隆及作为筛选标记的研究
[J]. J4, 2008, 43(7): 78 -82 .
[2] 谭秀梅,王华田*,孔令刚,王延平 . 杨树人工林连作土壤中酚酸积累规律及对土壤微生物的影响[J]. J4, 2008, 43(1): 14 -19 .
[3] 章 玲,周德群 . λ模糊测度及其Mbius变换和关联系数间关系的推导[J]. J4, 2007, 42(7): 33 -37 .
[4] . B(H)上的正交可导映射[J]. J4, 2009, 44(6): 4 -6 .
[5] 孙小婷1,靳岚2*. DOSY在寡糖混合物分析中的应用[J]. J4, 2013, 48(1): 43 -45 .
[6] 梁军1,2,陈龙2,周卫琪2,陶文倩1,姚明2,胥正川3. 基于马尔科夫随机场和鲁棒误差函数的半监督分类研究[J]. J4, 2010, 45(11): 1 -4 .
[7] 李霞,蒋盛益. 基于DOM树及行文本统计去噪的网页文本抽取技术[J]. J4, 2012, 47(3): 38 -42 .
[8] 田艳丽,侯万国,*,周维芝,王文兴 . 大气颗粒物的表面电特性和界面酸-碱反应特征平衡常数[J]. J4, 2007, 42(11): 15 -18 .
[9] 蔡红云, 田俊峰. 云计算中的数据隐私保护研究[J]. 山东大学学报(理学版), 2014, 49(09): 83 -89 .
[10] 费秀海,戴磊,朱国卫. 三角代数上Lie积为平方零元的非线性Jordan高阶可导映射[J]. 《山东大学学报(理学版)》, 2019, 54(12): 50 -58 .