您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (11): 7-12.doi: 10.6040/j.issn.1671-9352.0.2016.238

• • 上一篇    下一篇

基于信息损失量的特征选择方法研究及应用

李钊1,2,3,孙占全2,3,4,李晓2,3,李诚2,3   

  1. 1. 北京交通大学软件学院, 北京 100044; 2. 山东省计算中心(国家超级计算济南中心),山东 济南 250014;3. 山东省电子政务大数据工程技术研究中心, 山东 济南 250014; 4. 山东省计算机网络重点实验室, 山东 济南 250014)
  • 收稿日期:2016-05-26 出版日期:2016-11-20 发布日期:2016-11-22
  • 作者简介:李钊(1975— ),男,博士研究生,助理研究员,研究方向为大数据分析处理、软件工程.E-mail: liz@sdas.org
  • 基金资助:
    国家自然基金资助项目(61472230)

Study on feature selection method based on information loss

LI Zhao1,2.3, SUN Zhan-2,3, LI Xiao,2,3, LI Cheng2.3   

  1. 1. School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China;2. Shandong Computer Science Center(National Supercomputer Center in Jinan), Jinan 250014, Shandong, China;3. Shandong Engineering Technology Research Center of Egovernment Big Data, Jinan 250014, Shandong, China;4. Shandong Provincial Key Laboratory of Computer Networks, Jinan 250014, Shandong, China
  • Received:2016-05-26 Online:2016-11-20 Published:2016-11-22

摘要: 通过研究特征变量与类变量的信息测度和特征子集与类变量之间信息测度计算方法,实现快速的特征选择。将基于扩展熵的信息损失量测度用于度量类变量之间的相关性。为避免计算联合互信息的复杂计算,提出了基于信息损失量的变量相关度增加量计算方法,在保证新增特征可提供更多信息量前提下,同时提高特征选择的速度。最后对UCI的3种分类数据集进行实例分析,利用支持向量机对选择的特征子集进行分类验证,并将分类结果与其它常用特征选择方法进行了比较。结果表明所提出的特征选择方法比现有的特征选择方法更有效。

关键词: 信息损失, 互信息, 信息瓶颈理论, 扩展熵, 特征选择

Abstract: The purpose of this paper is to realize fast feature selection through studying the measure metric between the features and the calculation method of correlation between class variable and selected feature subset. A novel information loss metric based on extended entropy was proposed and used to measure the correlation between features. For avoiding calculating complicated combination mutual information, a novel feature selection method based on information loss was proposed. The method assures that the selected feature can increase the most information of the selected feature set. At last, the proposed method was used to analyze 3 kinds of practical classification dataset downloaded from UCI public dataset. Feature selection results are tested with Support Vector Machine and the results were compared with some other feature selection methods. Comparison results show that the proposed method in this paper is more efficient than others.

Key words: information loss, information bottleneck theory, mutual information, feature selection, extended entropy

中图分类号: 

  • TP311
[1] 姚旭, 王晓丹, 张玉玺, 等. 特征选择方法综述[J]. 控制与决策, 2012, 27(2): 161-166. YAO Xu, WANG Xiaodan, ZHANG Yuxi, et al. Summary of feature selection algorithms[J]. Control and Decision, 2012, 27(2):161-166.
[2] LIU Xiaoming, TANG Jinshan. Mass classification in mammograms using selected geometry and texture features, and a new svm-based feature selection method[J]. IEEE Systems Journal, 2014, 8(3):910-920.
[3] WANG De, NIE Feiping, HUANG Heng. Feature selection via global redundancy minimization[J].IEEE Transactions on Knowledge and Data Engineering, 2015, 27(10):2743-2755.
[4] HOU Chengping, NIE Feiping, LI Xuelong, et al. Joint embedding learning and sparse regression: a framework for unsupervised feature selection[J]. IEEE Transactions on Cybernetics, 2014, 44(6):793-804.
[5] STEFANO B, ANDREA E, FABRIZIO S. Feature selection for ordinal text classification[J]. Neural Computation, 2014, 26(3):557-591.
[6] AROQUIARAJ I L, THANGAVEL K. Mammogram image feature selection using unsupervised tolerance rough set relative reduct algorithm[C] //International Conference on Pattern Recognition, Informatics and Mobile Engineering(PRIME). New York: IEEE, 2013:479-484.
[7] SUN Zhanquan, LI Zhao. Data intensive parallel feature selection method study[C] //International Joint Conference on Neural Networks(IJCNN). NewYork: IEEE, 2014: 2256-2262.
[8] 徐峻岭, 周毓明, 陈林, 等. 基于互信息的无监督特征选择[J]. 计算机研究与发展,2012, 49(2):372-382. XU Junling, ZHOU Yuming, CHEN Lin, et al. An unsupervised feature selection approach based on mutual information[J]. Journal of Computer Research and Development, 2012, 49(2):372-382.
[9] GOLDBERGER J, GORDON S, GREENSPAN H. Unsupervised image-set clustering using an information theoretic framework[J]. IEEE Transactions on Image Processing, 2006, 15(2):449-458.
[10] SIMONE C, LUCIO M, CARLO S R. Information bottleneck-based relevant knowledge representation in large-scale video surveillance systems[C] // IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). New York: IEEE, 2014:4364-4368.
[11] CORTES C, VAPNIK V. Support Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
[12] FAN R E, CHEN P H, LIN C J. Working set selection using second order information for training support vector machines[J]. Journal of Machine Learning Research, 2005, 6(4):1889-1918.
[1] 龚双双,陈钰枫,徐金安,张玉洁. 基于网络文本的汉语多词表达抽取方法[J]. 山东大学学报(理学版), 2018, 53(9): 40-48.
[2] 黄天意,祝峰. 基于流形学习的代价敏感特征选择[J]. 山东大学学报(理学版), 2017, 52(3): 91-96.
[3] 万中英,王明文,左家莉,万剑怡. 结合全局和局部信息的特征选择算法[J]. 山东大学学报(理学版), 2016, 51(5): 87-93.
[4] 桑乐园, 徐新峰, 张婧, 黄德根. 基于广义Jaccard系数的微博情感新词判定[J]. 山东大学学报(理学版), 2015, 50(07): 71-75.
[5] 夏梦南, 杜永萍, 左本欣. 基于依存分析与特征组合的微博情感分析[J]. 山东大学学报(理学版), 2014, 49(11): 22-30.
[6] 郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报(理学版), 2014, 49(11): 74-81.
[7] 于然1,2,刘春阳3*,靳小龙1,王元卓1,程学旗1. 基于多视角特征融合的中文垃圾微博过滤[J]. J4, 2013, 48(11): 53-58.
[8] 易超群,李建平,朱成文. 一种基于分类精度的特征选择支持向量机[J]. J4, 2010, 45(7): 119-121.
[9] 常晓丽 李金屏. 一种新的多模态图像集成配准方法[J]. J4, 2009, 44(9): 35-39.
[10] 杨玉珍 刘培玉 朱振方 邱烨. 应用特征项分布信息的信息增益改进方法研究[J]. J4, 2009, 44(11): 48-51.
[11] 袁晓航,杜小勇 . iRIPPER——一种改进的基于规则学习的文本分类算法[J]. J4, 2007, 42(11): 66-68 .
[12] 余俊英,王明文,盛 俊 . 文本分类中的类别信息特征选择方法[J]. J4, 2006, 41(3): 144-148 .
[13] 付雪峰,刘邱云,王明文 . 基于互信息的粗糙集信息检索模型[J]. J4, 2006, 41(3): 116-119 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!