您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (11): 74-81.doi: 10.6040/j.issn.1671-9352.3.2014.328

• 论文 • 上一篇    下一篇

基于情感主题模型的特征选择方法

郑妍1, 庞琳2, 毕慧2, 刘玮2, 程工2   

  1. 1. 北京北大方正电子有限公司, 北京 100085;
    2. 国家计算机网络应急技术处理协调中心, 北京 100083
  • 收稿日期:2014-08-28 修回日期:2014-10-17 出版日期:2014-11-20 发布日期:2014-11-25
  • 通讯作者: 程工(1972- ),男,副研究员,研究方向为舆情分析、信息安全等.E-mail:cg@isc.org.cn E-mail:cg@isc.org.cn
  • 作者简介:郑妍(1981- ),女,硕士,高级工程师,研究方向为舆情分析、文本挖掘等.E-mail:cindy.zhengyan@163.com

Feature selection algorithm based on sentiment topic model

ZHENG Yan1, PANG Lin2, BI Hui2, LIU Wei2, CHENG Gong2   

  1. 1. Beijing Founder Electronics CO., Ltd, Beijing 100085, China;
    2. National Computer Network Emergency Response Technical Team Coordination Center of China, Beijing 100083, China
  • Received:2014-08-28 Revised:2014-10-17 Online:2014-11-20 Published:2014-11-25

摘要: 意见挖掘在企业智能分析、政府舆情分析等领域发挥着重要作用,为了充分挖掘主观性文本所蕴含的商业价值和社会价值,提出了一种基于情感主题模型的特征选择方法.该方法重点考察极性词及其共现现象,采用主题模型挖掘出正面褒义主题和负面贬义主题中极性词的分布情况,旨在度量情感特征在情感倾向表达中的重要性.实验阶段结合支持向量机分类器进行分析.实验表明该特征选择方法能有效提高跨领域文本情感分类准确性,具有较好的实用价值.

关键词: 特征选择, 意见挖掘, 主题模型, 文本分类

Abstract: In order to exert potential commercial value and social value of subjectivity text in enterprise business intelligence and public opinion survey and so on, a novel feature selection algorithm based on sentiment topic model was proposed, which takes both opinion term and opinion co-occurrence term into consideration to help topic modeling, and then the conditional distributions of opinion term in positive topic and negative topic were effectively estimated. This method tries to measure the importance of opinion feature in sentiment orientation. SVM was used in the experimental stage for classification.The experiment result shows that the algorithm has a higher recognition ratio and offers practical capabilities for cross-domain.

Key words: text classification, feature selection, opinion mining, topic model

中图分类号: 

  • TP391
[1] KIM S M, HOVY E. Determining the sentiment of opinions[C]//Proceedings of the 20th International Conference on Computational Linguistics (COLING).Morristown:Association for Computational Linguistics, 2004:1367-1373.
[2] 马柏樟,颜志军. 基于潜在狄利特雷分布模型的网络评论产品特征抽取方法[J]. 计算机集成制造系统,2014,20(1):96-103. MA Baizhang, YAN Zhijun. Product features extraction of online reviews based on LDA model[J]. Computer Integrated Manufacturing Systems, 2014, 20(1):96-103.
[3] KAMAL A, ABULAISH M, ANWAR T. Mining feature-opinion pairs and their reliability scores from web opinion sources [C]//Proceedings of International Conference on Web Intelligence, Mining and Semantics(WIMS'2012). [S.l.]:[s.n.], 2012.
[4] WILSON T, WIEBE J, HOFFMANN P. Recognizing contextual polarity in phrase-level sentiment analysis [C]//Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language(HLT/EMNLP 2005). Vancouver, BC, Canada: [s.n.], 2005:347-354.
[5] PANG Bo, LEE L. Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales [C]//Proceedings of the 43rd Annual Meeting of Association for Computational Linguistics. Somerset: ACL, 2005:115-124.
[6] TAN Songbo, ZHANG Jin. An empirical study of sentiment analysis for Chinese documents [J]. Expert Systems with Applications, 2008, 34(4):2622-2629.
[7] 边肇祺, 张学工. 模式识别[M].2版. 北京:清华大学出版社,2000. BIAN Zhaoqi, ZHANG Xuegong. Pattern recognition[M]. 2nd. Beijing: Tsinghua University Press, 2000.
[8] APTE C. Automated learning of decision rules for text categorization [J]. ACM transactions on information systems, 1994, 12: 233-251.
[9] YANG Yiming, PEDERSON J O. A comparative study on feature selection in text categorization [C]//Proceedings of the 14th International Conference on Machine Learning. [S.l.]:[s.n.], 1997: 412-420.
[10] WHITERLAW C, GARG N, ARGAMON S. Using appraisal groups for sentiment analysis [C]//Proceedings of International Conference on Information and Know-ledge Management(CIKM'2005). New York: ACM, 2005: 625-631.
[11] FU Guohong, WANG Xin. Chinese sentence-level sentiment classification based on fuzzy sets[C]//Proceedings of, International Conference on Computational Linguistics(Coling'2010). Beijing, China, 2010: 312-319.
[12] GUO H, ZHU H, GUO Z. Domain customization for aspect-oriented opinion analysis with multi-level latent sentiment clues [C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. Maui, HI, USA, 2011: 2493-2496.
[13] 徐琳宏, 林鸿飞, 杨志豪. 基于语义理解的文本倾向性识别机制[J]. 中文信息学报, 2007, 21(1):96-100. XU Linhong, LIN Hongfei, YANG Zhihao. Text orientation indentification based on semantic comprehension[J]. Chinese Information Processing, 2007, 21(1): 96-100.
[14] LANDAUER T K, FOLTZ P W, LAHAM D. An introduction to latent semantic analysis[J]. Discourse Processes, 1998, 25(2): 259-284.
[15] HOFMANN T. Probabilistic latent semantic indexing [C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999: 50-57.
[16] BLEI D M, NG A Y, JORDAN M I, et al. Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[17] BLEI D M, MCAULIFFE J D. Supervised topic models [EB/OL]. [2014-04-09].http://arxiv.org/pdf/1003.0783v1.pdf.
[18] RAMAGE D, HALL D, NALLAPATI R. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora [C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language. Philadelphia,PA,USA: Association for Computational Linguistics, 2009: 248-256.
[19] ALSUMAIT L, BARBARA D, DOMENICONI C. On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking [C]//Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). Washington: IEEE Computer Society, 2008: 3-12.
[20] YAN Xiaohui, GUO Jiafeng, LAN Yanyan. A biterm topic model for short texts [C]//Proceedings of the 22nd International Conference on World Wide Web. Brazil: [s.n.], 2013: 1445-1456.
[21] TAN S B. Chinese sentiment corpus. [DB/OL]. [2014-04-09]. http://www.searchforum.org.cn/tansongbo/senti_ corpus.jsp.
[22] CHANG C C, LIN C J. LIBSVM: a library for support vector machines [CP/OL]. [2014-04-09]. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[1] 黄天意,祝峰. 基于流形学习的代价敏感特征选择[J]. 山东大学学报(理学版), 2017, 52(3): 91-96.
[2] 万中英,王明文,左家莉,万剑怡. 结合全局和局部信息的特征选择算法[J]. 山东大学学报(理学版), 2016, 51(5): 87-93.
[3] 李钊,孙占全,李晓,李诚. 基于信息损失量的特征选择方法研究及应用[J]. 山东大学学报(理学版), 2016, 51(11): 7-12.
[4] 马宇峰, 阮彤. 基于LDA及标签传播的实体集合扩展[J]. 山东大学学报(理学版), 2015, 50(03): 20-27.
[5] 马成龙, 姜亚松, 李艳玲, 张艳, 颜永红. 基于词矢量相似度的短文本分类[J]. 山东大学学报(理学版), 2014, 49(12): 18-22.
[6] 夏梦南, 杜永萍, 左本欣. 基于依存分析与特征组合的微博情感分析[J]. 山东大学学报(理学版), 2014, 49(11): 22-30.
[7] 焦潞林, 彭岩, 林云. 面向网络舆情的文本知识发现算法对比研究[J]. 山东大学学报(理学版), 2014, 49(09): 62-68.
[8] 王少鹏, 彭岩, 王洁. 基于LDA的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报(理学版), 2014, 49(09): 129-134.
[9] 于然1,2,刘春阳3*,靳小龙1,王元卓1,程学旗1. 基于多视角特征融合的中文垃圾微博过滤[J]. J4, 2013, 48(11): 53-58.
[10] 刘伍颖,易绵竹,张兴. 一种时空高效的多类别文本分类算法[J]. J4, 2013, 48(11): 99-104.
[11] 蒋盛益1,庞观松2,张建军3. 基于聚类的垃圾邮件识别技术研究[J]. J4, 2011, 46(5): 71-76.
[12] 黄贤立,罗冬梅. 倾向性文本迁移学习中的特征重要性研究[J]. J4, 2010, 45(7): 13-17.
[13] 易超群,李建平,朱成文. 一种基于分类精度的特征选择支持向量机[J]. J4, 2010, 45(7): 119-121.
[14] 杨玉珍 刘培玉 朱振方 邱烨. 应用特征项分布信息的信息增益改进方法研究[J]. J4, 2009, 44(11): 48-51.
[15] 袁晓航,杜小勇 . iRIPPER——一种改进的基于规则学习的文本分类算法[J]. J4, 2007, 42(11): 66-68 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!