JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2024, Vol. 59 ›› Issue (12): 130-140.doi: 10.6040/j.issn.1671-9352.7.2023.5296

Previous Articles    

A novel unsupervised feature selection method

WANG Tinghua, HU Zhenwei, ZHAN Hongxiang   

  1. School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, Jiangxi, China
  • Published:2024-12-12

Abstract: Most HSIC-based feature selection methods are subject to the following limitations. First, these methods are typically only applied to labeled data, which is not feasible since most of the data in real-world applications is unlabeled. Second, existing HSIC-based unsupervised feature selection methods only address the general correlation between the selected features and the output values representing the underlying clustering structure, while ignoring the redundancy between different features. To address these issues, a new unsupervised feature selection method based on HSIC(UFSHSIC)is proposed, which utilizes HSIC as a correlation criterion to explore the correlation between features and the overall sample structure, as well as the redundancy between features. Experimental comparison with other classical feature selection methods on multiple real datasets shows that the proposed method can effectively perform feature selection from unlabeled samples, and the selected feature subset produces equivalent or better performance than supervised feature selection methods.

Key words: unsupervised feature selection, Hilbert-Schmidt independence criterion(HSIC), kernel method, machine learning, feature redundancy

CLC Number: 

  • TP181
[1] CAI Jie, LUO Jiawei, WANG Shulin, et al. Feature selection in machine learning: a new perspective[J]. Neurocomputing, 2018, 300:70-79.
[2] 许行,张凯,王文剑. 一种小样本数据的特征选择方法[J]. 计算机研究与发展,2018,55(10):2321-2330. XU Hang, ZHANG Kai, WANG Wenjian. A feature selection method for small samples[J]. Journal of Computer Research and Development, 2018, 55(10):2321-2330.
[3] HUANG Rui, WU Zhejun. Multi-label feature selection via manifold regularization and dependence maximization[J]. Pattern Recognition, 2021, 120(8):108149.
[4] SOLORIO-FERNÁNDEZ S, CARRASCO-OCHOA J, MARTÍNEZ-TRINIDAD J. A review of unsupervised feature selection methods[J]. Artificial Intelligence Review, 2020, 53(2):907-948.
[5] AGGARWAL C, REDDY K. Data clustering[M]. New York: Chapman and Hall, 2014:29-60.
[6] LI Jundong, CHENG Kewei, WANG Suhang, et al. Feature selection: a data perspective[J]. ACM Computing Surveys, 2017, 50(6):1-45.
[7] SONG L, SMOLA A, GRETTON A, et al. Feature selection via dependence maximization[J]. Journal of Machine Learning Research, 2012, 13(5):1393-1434.
[8] SONG L, BEDO J, BORGWARDT K M, et al. Gene selection via the BAHSIC family of algorithms[J]. Bioinformatics, 2007, 23(13):i490-i498.
[9] 刘艳芳,李文斌,高阳. 基于自适应邻域嵌入的无监督特征选择算法[J]. 计算机研究与发展,2020,57(8):1639-1649. LIU Yanfang, LI Wenbin, GAO Yang. Adaptive neighborhood embedding based unsupervised feature selection[J]. Journal of Computer Research and Development, 2020, 57(8):1639-1649.
[10] LIAGHAT S, MANSOORI E. Filter-based unsupervised feature selection using Hilbert-Schmidt independence criterion[J]. International Journal of Machine Learning and Cybernetics, 2019, 10(9):2313-2328.
[11] LUO Minna, NIE Feiping, CHANG Xiaojun, et al. Adaptive unsupervised feature selection with structure regularization[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(4):944-956.
[12] GUYON I, ELISSEEFF A. An introduction to variable and feature selection[J]. Journal of Machine Learning Research, 2003, 3:1157-1182.
[13] DEVAKUMARI D, THANGAVEL K. Unsupervised adaptive floating search feature selection based on contribution entropy[C] //Proceedings of the International Conference on Communication and Computational Intelligence. Erode, India: IEEE, 2010:623-627.
[14] HUANG Rui, JIANG Weidong, SUN Guangling. Manifold-based constraint Laplacian score for multi-label feature selection[J]. Pattern Recognition Letters, 2018, 112:346-352.
[15] MA Rui, WANG Yijie, CHENG Li. Feature selection on data stream via multi-cluster structure preservation[C] //Proceedings of the 29th ACM International Conference on Information and Knowledge Management. New York: ACM, 2020:1065-1074.
[16] YANG Meng, SHANG Ronghua, JIAO Licheng, et al. Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering[J]. Neurocomputing, 2018, 290:87-99.
[17] YANG Yi, SHEN Hengtao, MA Zhigang, et al. l2,1-norm regularized discriminative feature selection for unsupervised learning[C] //Proceedings of the 22th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann, 2011:1589-1594.
[18] ZHU Pengfei, XU Qian, HU Qinghua, et al. Co-regularized unsupervised feature selection[J]. Neurocomputing, 2018, 275:2855-2863.
[19] BEHURA A. Data analytics in bioinformatics: a machine learning perspective, Massachusetts[M]. Beverly, Massachusetts: Scrivener Publishing, 2021:249-280.
[20] YAN X, NAZMI S, EROL B A, et al. An efficient unsupervised feature selection procedure through feature clustering[J]. Pattern Recognition Letters, 2020, 131:277-284.
[21] 张晨光,张燕,张夏欢. 最大规范化依赖性多标记半监督学习方法[J]. 自动化学报,2015,41(9):1577-1588 ZHANG Chenguang, ZHANG Yan, ZHANG Xiahuan. Maximum normative-dependent multilabel semi-supervised learning methods[J]. Acta Automatica Sinica, 2015, 41(9):1577-1588.
[22] GRETTON A, BOUSQUET O, SMOLA A, et al. Measuring statistical dependence with Hilbert-Schmidt norms[C] //Proceedings of the 16th International Conference on Algorithmic Learning Theory. Berlin: Springer-Verlag, 2005, 63-77.
[23] WANG Tinghua, DAI Xiaolu, LIU Yuze. Learning with Hilbert-Schmidt independence criterion: a review and new perspectives[J]. Knowledge-based Systems, 2021, 234:107567.
[24] 王博,黄九鸣,贾焰,等. 适用于多种监督模型的特征选择方法研究[J]. 计算机研究与发展,2010,47(9):1548-1557. WANG Bo, HUANG Jiuming, JIA Yan, et al. A study of feature selection methods applicable to multiple supervised models[J]. Journal of Computer Research and Development, 2010, 47(9):1548-1557.
[25] 胡振威,汪廷华,周慧颖. 基于核统计独立性准则的特征选择研究综述[J]. 计算机工程与应用,2022,58(22):54-64. HU Zhenwei, WANG Tinghua, ZHOU Huiying. A review of feature selection methods based on kernel statistical independence criteria[J]. Computer Engineering and Applications, 2022,58(22):54-64.
[26] ABUGABAH A, ALZUBI A A, AL-OBEIDAT F, et al. Data mining techniques for analyzing healthcare conditions of urban space-person lung using meta-heuristic optimized neural networks[J]. Cluster Computing, 2020, 23(3):1781-1794.
[27] REN Weijie, LI Baisong, HAN Min. A novel granger causality method based on HSIC-Lasso for revealing nonlinear relationship between multivariate time series[J]. Physica A: Statistical Mechanics and Its Applications, 2020, 541:123245.
[28] JO I, LEE S, OH S. Improved measures of redundancy and relevance for mRMR feature selection[J]. Computers, 2019, 8(2):42.
[29] 刘吉超,王锋. 基于Relief-F的半监督特征选择算法[J]. 郑州大学学报(理学版),2021,53(1):42-46. LIU Jichao, WANG Feng. A semi supervised feature selection algorithm based on Relief-F[J]. Journal of Zhengzhou University(Natural Science Edition), 2021, 53(1):42-46.
[30] SINAGA K P, YANG M S. Unsupervised k-means clustering algorithm[J]. IEEE Access, 2020, 8:80716-80727.
[31] UCI machine learning repository[DB/OL]. [2022-12-29]. http://archive.ics.uci.edu/ml/index.php.
[1] Yiran LI,Ning ZHAO,Zhijian ZHANG. Prediction of average queue time in multi-server tandem queueing systems [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2024, 59(1): 17-26.
[2] ZHANG Jie, PENG Guo-jun, YANG Xiu-zhang. Malicious evasion sample detection based on dynamic API call sequence and machine learning [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(7): 85-93.
[3] LI Ying, ZHANG Guo-lin. Modeling for dissolved gases concentration based on mutual information and kernel entropy component analysis [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(7): 43-52.
[4] An-min ZHOU,Lei HU,Lu-ping LIU,Peng JIA,Liang LIU. Malicious Office document detection technology based on entropy time series [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2019, 54(5): 1-7.
[5] LIU Ming, ZAN Hong-ying, YUAN Hui-bin. Key sentiment sentence prediction using SVM and RNN [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(11): 68-73.
[6] PAN Qing-qing, ZHOU Feng, YU Zheng-tao, GUO Jian-yi, XIAN Yan-tuan. Recognition method of Vietnamese named entity based on#br# conditional random fields [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(1): 76-79.
[7] DU Rui-ying, YANG Yong, CHEN Jing, WANG Chi-heng. An efficient network traffic classification scheme based on similarity [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(09): 109-114.
[8] SUN Zhong-gui1, CHEN Jie2. Improvement for non-local means using data driven kernel [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 24-27.
[9] DONG Yuan1, XU Ya-bin1,2*, LI Zhuo1,2, LI Yan-ping1. Research on spam identification based on social computing and machine learning [J]. J4, 2013, 48(7): 72-78.
[10] HUANG Lin-sheng1, DENG Zhi-hong1,2, TANG Shi-wei1,2, WANG Wen-qing3, CHEN Ling3. A Chinese organization′s full name and matching abbreviation  algorithm based on edit-distance [J]. J4, 2012, 47(5): 43-48.
[11] WANG Peng-ming, ZHONG Mao-sheng, LIU Zun-xiong. Multiple kernel learning in denoising space [J]. J4, 2012, 47(5): 49-52.
[12] WAN Hai-ping,HE Hua-can,ZHOU Yan-quan . Locality preserving kernel method and its application [J]. J4, 2006, 41(3): 18-20 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] GUO Ting-ting,KONG Wen-tao,KONG Jian*,JI Ming-jie .

Cloning of a nisin resistance gene from Lactococcus lactis and
its application in food-grade selection marker

[J]. J4, 2008, 43(7): 78 -82 .
[2] TAN Xiu-mei,WANG Hua-tian*,KONG Ling-gang,WANG Yan-ping . Accumulation of phenolic acids in soil of a continuous cropping Poplar plantation and their effects on soil microbes[J]. J4, 2008, 43(1): 14 -19 .
[3] ZHANG Ling ,ZHOU De-qun . Research on the relationships among the λ fuzzy measures, Mbius representation and interaction representation[J]. J4, 2007, 42(7): 33 -37 .
[4] . Orthogonal derivable mappings on  B(H)[J]. J4, 2009, 44(6): 4 -6 .
[5] SUN Xiao-ting1, JIN Lan2*. Application of DOSY in oligosaccharide mixture analysis[J]. J4, 2013, 48(1): 43 -45 .
[6] LIANG Jun1,2, CHEN Long2, ZHOU Wei-qi2, TAO Wen-qian1, YAO Ming2, XU Zheng-chuan3. Semi-supervised classification based on the Markov random field and robust error function[J]. J4, 2010, 45(11): 1 -4 .
[7] LI Xia, JIANG Sheng-yi. Content extraction from web page based on the DOM tree and line-text statistical noise-elimination[J]. J4, 2012, 47(3): 38 -42 .
[8] TIAN Yan-li,HOU Wan-guo,*,ZHOU Wei-zhi,WANG Wen-xing . Interface electrochemical property and intrinsic surface reaction equilibrium constants of atmospheric particulate matter[J]. J4, 2007, 42(11): 15 -18 .
[9] CAI Hong-yun, TIAN Jun-feng. Research of data privacy protection for cloud computing[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(09): 83 -89 .
[10] FEI Xiu-hai, DAI Lei, ZHU Guo-wei. Nonlinear Jordan higher derivable maps on triangular algebras by Lie product square zero elements[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2019, 54(12): 50 -58 .