基于LDA的文本聚类在网络舆情分析中的应用研究

doi:10.6040/j.issn.1671-9352.2.2014.327

山东大学学报（理学版） ›› 2014, Vol. 49 ›› Issue (09): 129-134.doi: 10.6040/j.issn.1671-9352.2.2014.327

基于LDA的文本聚类在网络舆情分析中的应用研究

王少鹏¹, 彭岩², 王洁²

1. 首都师范大学信息工程学院, 北京 100048;
2. 首都师范大学管理学院, 北京 100089

收稿日期:2014-06-24 修回日期:2014-08-28 出版日期:2014-09-20 发布日期:2014-09-30
通讯作者: 彭岩（1967-），女，教授，博士，研究方向为数据挖掘、智能管理与决策.E-mail：pengyanpy@163.com E-mail:pengyanpy@163.com
作者简介:王少鹏（1987-），男，硕士研究生，研究方向为智能应用.E-mail：wangshaopengxc@163.com
基金资助:
北京市自然科学基金资助项目（9142002）；北京市教育委员会科技计划面上项目（KM201310028020）

Research of the text clustering based on LDA using in network public opinion analysis

WANG Shao-peng¹, PENG Yan², WANG Jie²

1. College of Information Engineering, Capital Normal University, Beijing 100048, China;
2. School of Management, Capital Normal University, Beijing 100089, China

Received:2014-06-24 Revised:2014-08-28 Online:2014-09-20 Published:2014-09-30

摘要/Abstract

摘要： 针对传统的基于词语的文本聚类算法忽略了文本中可能具有的隐含信息的问题，提出了一种基于LDA（latent dirichlet allocation）主题模型的文本聚类算法。该方法利用TF-IDF算法和LDA主题模型分别计算文本的相似度，通过耗费函数确定文本相似度的融合系数并进行线性结合来获取文本之间的相似度，同时使用F-measure值来对聚类结果进行评估。在构建LDA主题模型时，采用Gibbs抽样来进行参数估计，通过贝叶斯统计的标准方法进行最优主题数的确定。从仿真实验的聚类结果的准确性和稳定性来看，该方法相比传统的文本聚类算法具有更良好的效果。

关键词: 网络舆情, 主题模型, LDA, 文本相似度, TF-IDF

Abstract: For the problem that hidden information of the text may be ignored by the traditional text clustering algorithm based on words, a kind of text clustering algorithm based on the latent dirichlet allocation(LDA) topic model was proposed. The algorithm uses the TF-IDF algorithm and LDA topic model to calculate text similarity, through the cost function to determine the fusion coefficient of text similarity, through linear combination to get the similarity between texts and uses the F-measure value to evaluate the clustering result. In the constructing of the LDA model, the algorithm uses Gibbs sampling to estimate the parameter, and through the Bias statistical standard method to determine the optimal number of topics. Viewing from the accuracy and stability of clustering results, the simulation results show that the proposed algorithm has a better effect than the traditional text clustering algorithm.

Key words: topic model, LDA, TF-IDF, text similarity, network public opinion

中图分类号:

TP391

王少鹏, 彭岩, 王洁. 基于LDA的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报（理学版）, 2014, 49(09): 129-134.

WANG Shao-peng, PENG Yan, WANG Jie. Research of the text clustering based on LDA using in network public opinion analysis[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(09): 129-134.

参考文献

[1] 周宏仁,唐铁汉. 网络舆情电子政务知识读本[M]. 北京:国家行政学院出版社,2002. ZHOU Hongren, TANG Tiehan. The network public opinion E-government knowledge [M]. Beijing: National School of Administration Press, 2002.
[2] 周德懋,李舟军.高性能网络爬虫:研究综述[J].计算机科学,2009, 36(8):26-29, 53. ZHOU Demao, LI Zhoujun. Survey of high-performance web crawler [J]. Computer Science, 2009, 36(8): 26-29, 53.
[3] 邵秀丽,乜聚科,田振雷,等.用户个性化推荐系统的设计与实现[J].计算机工程与设计,2009(20):4681-4685. SHAO Xiuli, NIE Juke, TIAN Zhenlei, et al. Design and implementation of personalized recommendation system for user [J]. Computer Engineering and Design, 2009(20): 4681-4685.
[4] 杨俊峰,黎建辉,杨风雷.深层网站Ajax页面数据采集研究综述[J].计算机应用研究,2013, 30(6):1606-1610, 1616. YANG Junfeng, LI Jianhui, YANG Fenglei. Survey on research of data collection from supporting Ajax technology deep web sites [J]. Application Research of Computers, 2013, 30(6): 1606-1610, 1616.
[5] 赵岩,周斌,陈儒华.文本分类算法研究[J].软件导刊,2013, 12(10):54-56. ZHAO Yan, ZHOU Bin, CHEN Ruhua. Research on text categorization algorithm [J]. SoftWare Guide, 2013, 12(10): 54-56.
[6] BLEI T D, NG A, JORDAN M.Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2003(3): 993-1022.
[7] 张明慧,王红玲,周国栋.基于LDA主题特征的自动文摘方法[J].计算机应用与软件,2011, 28(10):20-22, 46. ZHANG Minghui, WANG Hongling, ZHOU Guodong. An automatic summarization approach based on lda topic feature [J]. Computer Applications and Software, 2011, 28(10): 20-22, 46.
[8] 贺喜,蒋建春,丁丽萍,等.基于LDA模型的主机异常检测方法[J].计算机应用与软件,2012, 29(8):1-4, 24. HE Xi, JIANG Jianchun, DING Liping, et al. A host anomaly detection method based on lda model[J]. Computer Applications and Software, 2012, 29(8): 1-4, 24.
[9] 付玲,张晖.结合LDA和谱聚类的多文档摘要[J].计算机工程与应用,2013(16):142-145, 154. FU Ling, ZHANG Hui. Multi-document summary using LDA and spectral clustering [J]. Computer Engineering and Applications, 2013(16): 142-145, 154.
[10] 杨燕,靳蕃,MOHAMED K.聚类有效性评价综述[J].计算机应用研究,2008, 25(6):1630-1632, 1638. YANG Yan, JIN Bo, MOHAMED K. Survey of clustering validity evaluation [J]. Application Research of Computers, 2008, 25(6): 1630-1632, 1638.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于LDA的文本聚类在网络舆情分析中的应用研究

Research of the text clustering based on LDA using in network public opinion analysis

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

多维度评价

本文评价

推荐阅读 0

[1]	张聪,裴家欢,黄锴宇,黄德根,殷章志. 基于语义图优化算法的中文微博观点摘要研究[J]. 山东大学学报（理学版）, 2017, 52(7): 59-65.
[2]	张新猛, 蒋盛益, 张倩生, 谢柏林, 李霞. 基于用户偏好加权的混合网络推荐算法[J]. 山东大学学报（理学版）, 2015, 50(09): 29-35.
[3]	王立人, 余正涛, 王炎冰, 高盛祥, 李贤慧. 基于有指导LDA用户兴趣模型的微博主题挖掘[J]. 山东大学学报（理学版）, 2015, 50(09): 36-41.
[4]	马宇峰, 阮彤. 基于LDA及标签传播的实体集合扩展[J]. 山东大学学报（理学版）, 2015, 50(03): 20-27.
[5]	郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报（理学版）, 2014, 49(11): 74-81.
[6]	焦潞林, 彭岩, 林云. 面向网络舆情的文本知识发现算法对比研究[J]. 山东大学学报（理学版）, 2014, 49(09): 62-68.
[7]	夏天1,2. Web数据的深度定向采集[J]. J4, 2011, 46(5): 34-38.
[8]	张国英,沙芸,江慧娜 . 基于粒子群优化的快速KNN分类算法[J]. J4, 2006, 41(3): 34-36 .