您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (11): 7-12.doi: 10.6040/j.issn.1671-9352.0.2016.238

• • 上一篇    下一篇

基于信息损失量的特征选择方法研究及应用

李钊1,2,3,孙占全2,3,4,李晓2,3,李诚2,3   

  1. 1. 北京交通大学软件学院, 北京 100044; 2. 山东省计算中心(国家超级计算济南中心),山东 济南 250014;3. 山东省电子政务大数据工程技术研究中心, 山东 济南 250014; 4. 山东省计算机网络重点实验室, 山东 济南 250014)
  • 收稿日期:2016-05-26 出版日期:2016-11-20 发布日期:2016-11-22
  • 作者简介:李钊(1975— ),男,博士研究生,助理研究员,研究方向为大数据分析处理、软件工程.E-mail: liz@sdas.org
  • 基金资助:
    国家自然基金资助项目(61472230)

Study on feature selection method based on information loss

LI Zhao1,2.3, SUN Zhan-2,3, LI Xiao,2,3, LI Cheng2.3   

  1. 1. School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China;2. Shandong Computer Science Center(National Supercomputer Center in Jinan), Jinan 250014, Shandong, China;3. Shandong Engineering Technology Research Center of Egovernment Big Data, Jinan 250014, Shandong, China;4. Shandong Provincial Key Laboratory of Computer Networks, Jinan 250014, Shandong, China
  • Received:2016-05-26 Online:2016-11-20 Published:2016-11-22

摘要: 通过研究特征变量与类变量的信息测度和特征子集与类变量之间信息测度计算方法,实现快速的特征选择。将基于扩展熵的信息损失量测度用于度量类变量之间的相关性。为避免计算联合互信息的复杂计算,提出了基于信息损失量的变量相关度增加量计算方法,在保证新增特征可提供更多信息量前提下,同时提高特征选择的速度。最后对UCI的3种分类数据集进行实例分析,利用支持向量机对选择的特征子集进行分类验证,并将分类结果与其它常用特征选择方法进行了比较。结果表明所提出的特征选择方法比现有的特征选择方法更有效。

关键词: 信息损失, 互信息, 信息瓶颈理论, 扩展熵, 特征选择

Abstract: The purpose of this paper is to realize fast feature selection through studying the measure metric between the features and the calculation method of correlation between class variable and selected feature subset. A novel information loss metric based on extended entropy was proposed and used to measure the correlation between features. For avoiding calculating complicated combination mutual information, a novel feature selection method based on information loss was proposed. The method assures that the selected feature can increase the most information of the selected feature set. At last, the proposed method was used to analyze 3 kinds of practical classification dataset downloaded from UCI public dataset. Feature selection results are tested with Support Vector Machine and the results were compared with some other feature selection methods. Comparison results show that the proposed method in this paper is more efficient than others.

Key words: information loss, information bottleneck theory, mutual information, feature selection, extended entropy

中图分类号: 

  • TP311
[1] 姚旭, 王晓丹, 张玉玺, 等. 特征选择方法综述[J]. 控制与决策, 2012, 27(2): 161-166. YAO Xu, WANG Xiaodan, ZHANG Yuxi, et al. Summary of feature selection algorithms[J]. Control and Decision, 2012, 27(2):161-166.
[2] LIU Xiaoming, TANG Jinshan. Mass classification in mammograms using selected geometry and texture features, and a new svm-based feature selection method[J]. IEEE Systems Journal, 2014, 8(3):910-920.
[3] WANG De, NIE Feiping, HUANG Heng. Feature selection via global redundancy minimization[J].IEEE Transactions on Knowledge and Data Engineering, 2015, 27(10):2743-2755.
[4] HOU Chengping, NIE Feiping, LI Xuelong, et al. Joint embedding learning and sparse regression: a framework for unsupervised feature selection[J]. IEEE Transactions on Cybernetics, 2014, 44(6):793-804.
[5] STEFANO B, ANDREA E, FABRIZIO S. Feature selection for ordinal text classification[J]. Neural Computation, 2014, 26(3):557-591.
[6] AROQUIARAJ I L, THANGAVEL K. Mammogram image feature selection using unsupervised tolerance rough set relative reduct algorithm[C] //International Conference on Pattern Recognition, Informatics and Mobile Engineering(PRIME). New York: IEEE, 2013:479-484.
[7] SUN Zhanquan, LI Zhao. Data intensive parallel feature selection method study[C] //International Joint Conference on Neural Networks(IJCNN). NewYork: IEEE, 2014: 2256-2262.
[8] 徐峻岭, 周毓明, 陈林, 等. 基于互信息的无监督特征选择[J]. 计算机研究与发展,2012, 49(2):372-382. XU Junling, ZHOU Yuming, CHEN Lin, et al. An unsupervised feature selection approach based on mutual information[J]. Journal of Computer Research and Development, 2012, 49(2):372-382.
[9] GOLDBERGER J, GORDON S, GREENSPAN H. Unsupervised image-set clustering using an information theoretic framework[J]. IEEE Transactions on Image Processing, 2006, 15(2):449-458.
[10] SIMONE C, LUCIO M, CARLO S R. Information bottleneck-based relevant knowledge representation in large-scale video surveillance systems[C] // IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). New York: IEEE, 2014:4364-4368.
[11] CORTES C, VAPNIK V. Support Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
[12] FAN R E, CHEN P H, LIN C J. Working set selection using second order information for training support vector machines[J]. Journal of Machine Learning Research, 2005, 6(4):1889-1918.
[1] 武晓军,陈怡丹,郝耀军,宋长伟,何德清. 具有标签流形和动态图约束的多标签特征选择[J]. 《山东大学学报(理学版)》, 2025, 60(7): 69-83.
[2] 吴辛尧,徐计. 基于图互信息池化的分层图表示学习[J]. 《山东大学学报(理学版)》, 2025, 60(7): 84-93.
[3] 程雨轩,毛煜,张小清,曾艺祥,林耀进. 基于次相关特征和邻域互信息的在线多标记特征选择算法[J]. 《山东大学学报(理学版)》, 2024, 59(5): 70-81.
[4] 高贺飞,李艳,王硕. 基于邻域粗糙集的偏标记特征选择[J]. 《山东大学学报(理学版)》, 2024, 59(5): 100-113.
[5] 朱礼全,林耀进,毛煜,程雨轩. 基于高维相关性多标签在线流特征选择[J]. 《山东大学学报(理学版)》, 2024, 59(5): 90-99.
[6] 史春雨,毛煜,刘浩阳,林耀进. 基于样本相关性的层次特征选择算法[J]. 《山东大学学报(理学版)》, 2024, 59(3): 61-70.
[7] 汪廷华,胡振威,占宏祥. 一种新颖的无监督特征选择方法[J]. 《山东大学学报(理学版)》, 2024, 59(12): 130-140.
[8] 张志浩,林耀进,卢舜,吴镒潾,王晨曦. 流缺失标记环境下的多标记特征选择[J]. 《山东大学学报(理学版)》, 2022, 57(8): 39-52.
[9] 李颖,张国林. 互信息和核熵成分分析的油中溶解气体浓度建模[J]. 《山东大学学报(理学版)》, 2022, 57(7): 43-52.
[10] 孙林,陈雨生,徐久成. 基于改进ReliefF的多标记特征选择算法[J]. 《山东大学学报(理学版)》, 2022, 57(4): 1-11.
[11] 孙林,梁娜,徐久成. 基于自适应邻域互信息与谱聚类的特征选择[J]. 《山东大学学报(理学版)》, 2022, 57(12): 13-24.
[12] 张要,马盈仓,杨小飞,朱恒东,杨婷. 结合流形结构与柔性嵌入的多标签特征选择[J]. 《山东大学学报(理学版)》, 2021, 56(7): 91-102.
[13] 李万理,唐婧尧,薛云,胡晓晖,张涛. 基于点互信息的全局词向量模型[J]. 《山东大学学报(理学版)》, 2019, 54(7): 100-105.
[14] 龚双双,陈钰枫,徐金安,张玉洁. 基于网络文本的汉语多词表达抽取方法[J]. 山东大学学报(理学版), 2018, 53(9): 40-48.
[15] 黄天意,祝峰. 基于流形学习的代价敏感特征选择[J]. 山东大学学报(理学版), 2017, 52(3): 91-96.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!