您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (09): 62-68.doi: 10.6040/j.issn.1671-9352.2.2014.389

• 论文 • 上一篇    下一篇

面向网络舆情的文本知识发现算法对比研究

焦潞林, 彭岩, 林云   

  1. 首都师范大学管理学院, 北京 100048
  • 收稿日期:2014-06-24 修回日期:2014-08-27 出版日期:2014-09-20 发布日期:2014-09-30
  • 通讯作者: 彭岩(1967-),女,教授,博士,研究方向为数据挖掘、智能管理与决策.E-mail:pengyanpy@163.com E-mail:pengyanpy@163.com
  • 作者简介:焦潞林(1989-),男,硕士研究生,研究方向为文本挖掘、智能决策与支持.E-mail:rucky04@163.com
  • 基金资助:
    北京市自然科学基金资助项目(9142002);北京市教育委员会科技计划面上项目(KM201410028020)

Comparative research on text knowledge discovery for network public opinion

JIAO Lu-lin, PENG Yan, LIN Yun   

  1. College of Management, Capital Normal University, Beijing 100048, China
  • Received:2014-06-24 Revised:2014-08-27 Online:2014-09-20 Published:2014-09-30

摘要: 针对网络舆情分析领域,研究了系统聚类、String Kernels、K最近邻算法(K-nearest neighbor,KNN)、SVM (support vector machine)算法以及主题模型5种聚类算法。以网络舆情数据为对象集,以R语言环境为实验工具,比较了这5种算法的优势与劣势,同时进行了仿真实验。实验结果表明,主题模型相对于其他算法在文本聚类方面具有更好的适用性,其中,主题模型中的CTM(correlated topic model)方法更适合于类别关系的探索与发现,而Gibbs抽样方法则在文本聚类上的表现优于CTM方法。

关键词: 文本知识发现, 文本聚类, 主题模型, 网络舆情

Abstract: According to the field of network public opinion analysis, five clustering algorithms: system clustering, string kernels, K nearest neighbor algorithm, support vector machine algorithm and topic models were studied. A comprehensive comparative research of these five algorithms was conducted by using network public opinion data as data set and R language environment as experimental tool. At the same time, simulation experiments were carried out to compare these five algorithms' strengths and weaknesses. Experimental results show that "topic model" has better applicability than other algorithms in terms of text clustering. After further experiments we also found in topic models, CTM(Correlated Topic Model) method is more suitable for the exploration and discovery of class relations while Gibbs sampling method on the performance of text clustering method is better than the CTM method.

Key words: topic model, network public opinion, text knowledge discovery, text clustering

中图分类号: 

  • TP309
[1] 胡雷芳.五种常用系统聚类分析方法及其比较[J]. 浙江统计,2007, 4: 12-13. HU Leifang. Five commonly used cluster analysis methods and their comparison [J]. Zhejiang Statistics, 2007, 4:12-13.
[2] Huma Lodhi, Craig Saunders, John Shawe-Taylo, et al. Text classification using String Kernels [J]. Journal of Machine Learning Research, 2002, 2: 419-444.
[3] LEI Zhen, JIANG Yanjie, ZHAO Peng, et al. News event tracking using an improved hybrid of KNN and SVM [J]. Communication and Networking, 2009, 56: 431-438.
[4] Gregor Heinrich. Parameter estimation for text analysis[R]. Darmstadt:Fraunhofer IGD, 2004.
[5] 常州大学.基于文本语义相关的网络舆情信息分析方法:中国,CN103544255 A[P]. 2014-01-29. Changzhou University. Text semantic relativity based network public
[7] 李岩, 娄云. 文本聚类算法在舆情监控中的应用分析[J].电子设计工程, 2013, 21(1):70-74. LI Yan, LOU Yun, Applied research of text clustering algorithm in network monitoring public opinion [J]. Electronic Design Engineering, 2013, 21(1):70-74.
[8] 杨震,段立娟,赖英旭. 基于字符串相似性聚类的网络短文本舆情热点发现技术[J].北京工业大学学报, 2010, 36(5):669-672. YANG Zhen, DUAN Lijun, LAI Yingxun. Public opinion hotpot discovery technology of network short text based on string similarity clustering[J]. Journal of Beijing University of Technology, 2010, 36(5):669-672.
[9] 李岩,韩斌,赵剑. 基于短文本及情感分析的微博舆情分析[J]. 计算机应用与软件, 2013, 30(12):240-243 LI Yan, HAN Bin, ZHAO Jian. Analyzing microblogging public opinions based on short text and sentiment analysis [J]. Computer Application and Software, 2013, 30(12):240-243.
[10] WANG Xing, XIONG Fei, LIU Yun. Research on micro-blog information perception and mining platform[J]. Advanced Technologies, Embedded and Multimedia for Human-centric Computing, 2014, 260:753-761
[11] Frida Borng, Rainer Eising, Heike Klüver, et al. Identifying frames: a comparison of research methods[J].Interest Groups and Advocacy, 2014, 3:188-201.opinion information analysis method: China, CN103544255 A[P]. 2014-01-29.
[6] 汤寒青,王汉军. 改进的K-means算法在网络舆情分析中的应用[J].计算机系统应用,2011, 20(3):165-168. TANG Hanqing, WANG Hanjun. Application of improved K-means algorithm to analysis of online public opinions[J].Computer System and Applications, 2011, 20(3):165-168.
[1] 刘颖莹,刘培玉,王智昊,李情情,朱振方. 一种基于密度峰值发现的文本聚类算法[J]. 山东大学学报(理学版), 2016, 51(1): 65-70.
[2] 马宇峰, 阮彤. 基于LDA及标签传播的实体集合扩展[J]. 山东大学学报(理学版), 2015, 50(03): 20-27.
[3] 郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报(理学版), 2014, 49(11): 74-81.
[4] 王少鹏, 彭岩, 王洁. 基于LDA的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报(理学版), 2014, 49(09): 129-134.
[5] 夏天1,2. Web数据的深度定向采集[J]. J4, 2011, 46(5): 34-38.
[6] 庞观松,张黎莎,蒋盛益*,邝丽敏,吴美玲. 一种基于名词短语的检索结果多层聚类方法[J]. J4, 2010, 45(7): 39-44.
[7] 索红光,王玉伟 . 一种用于文本聚类的改进k-means算法[J]. J4, 2008, 43(1): 60-64 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!