JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2014, Vol. 49 ›› Issue (09): 129-134.doi: 10.6040/j.issn.1671-9352.2.2014.327

Previous Articles     Next Articles

Research of the text clustering based on LDA using in network public opinion analysis

WANG Shao-peng1, PENG Yan2, WANG Jie2   

  1. 1. College of Information Engineering, Capital Normal University, Beijing 100048, China;
    2. School of Management, Capital Normal University, Beijing 100089, China
  • Received:2014-06-24 Revised:2014-08-28 Online:2014-09-20 Published:2014-09-30

Abstract: For the problem that hidden information of the text may be ignored by the traditional text clustering algorithm based on words, a kind of text clustering algorithm based on the latent dirichlet allocation(LDA) topic model was proposed. The algorithm uses the TF-IDF algorithm and LDA topic model to calculate text similarity, through the cost function to determine the fusion coefficient of text similarity, through linear combination to get the similarity between texts and uses the F-measure value to evaluate the clustering result. In the constructing of the LDA model, the algorithm uses Gibbs sampling to estimate the parameter, and through the Bias statistical standard method to determine the optimal number of topics. Viewing from the accuracy and stability of clustering results, the simulation results show that the proposed algorithm has a better effect than the traditional text clustering algorithm.

Key words: topic model, LDA, TF-IDF, text similarity, network public opinion

CLC Number: 

  • TP391
[1] 周宏仁,唐铁汉. 网络舆情电子政务知识读本[M]. 北京:国家行政学院出版社,2002. ZHOU Hongren, TANG Tiehan. The network public opinion E-government knowledge [M]. Beijing: National School of Administration Press, 2002.
[2] 周德懋,李舟军.高性能网络爬虫:研究综述[J].计算机科学,2009, 36(8):26-29, 53. ZHOU Demao, LI Zhoujun. Survey of high-performance web crawler [J]. Computer Science, 2009, 36(8): 26-29, 53.
[3] 邵秀丽,乜聚科,田振雷,等.用户个性化推荐系统的设计与实现[J].计算机工程与设计,2009(20):4681-4685. SHAO Xiuli, NIE Juke, TIAN Zhenlei, et al. Design and implementation of personalized recommendation system for user [J]. Computer Engineering and Design, 2009(20): 4681-4685.
[4] 杨俊峰,黎建辉,杨风雷.深层网站Ajax页面数据采集研究综述[J].计算机应用研究,2013, 30(6):1606-1610, 1616. YANG Junfeng, LI Jianhui, YANG Fenglei. Survey on research of data collection from supporting Ajax technology deep web sites [J]. Application Research of Computers, 2013, 30(6): 1606-1610, 1616.
[5] 赵岩,周斌,陈儒华.文本分类算法研究[J].软件导刊,2013, 12(10):54-56. ZHAO Yan, ZHOU Bin, CHEN Ruhua. Research on text categorization algorithm [J]. SoftWare Guide, 2013, 12(10): 54-56.
[6] BLEI T D, NG A, JORDAN M.Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2003(3): 993-1022.
[7] 张明慧,王红玲,周国栋.基于LDA主题特征的自动文摘方法[J].计算机应用与软件,2011, 28(10):20-22, 46. ZHANG Minghui, WANG Hongling, ZHOU Guodong. An automatic summarization approach based on lda topic feature [J]. Computer Applications and Software, 2011, 28(10): 20-22, 46.
[8] 贺喜,蒋建春,丁丽萍,等.基于LDA模型的主机异常检测方法[J].计算机应用与软件,2012, 29(8):1-4, 24. HE Xi, JIANG Jianchun, DING Liping, et al. A host anomaly detection method based on lda model[J]. Computer Applications and Software, 2012, 29(8): 1-4, 24.
[9] 付玲,张晖.结合LDA和谱聚类的多文档摘要[J].计算机工程与应用,2013(16):142-145, 154. FU Ling, ZHANG Hui. Multi-document summary using LDA and spectral clustering [J]. Computer Engineering and Applications, 2013(16): 142-145, 154.
[10] 杨燕,靳蕃,MOHAMED K.聚类有效性评价综述[J].计算机应用研究,2008, 25(6):1630-1632, 1638. YANG Yan, JIN Bo, MOHAMED K. Survey of clustering validity evaluation [J]. Application Research of Computers, 2008, 25(6): 1630-1632, 1638.
[1] ZHANG Cong, PEI Jia-huan, HUANG Kai-yu, HUANG De-gen, YIN Zhang-zhi. Semantic graph optimization algorithm based chinesemicroblog opinion summarization [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 59-65.
[2] ZHANG Xin-meng, JIANG Sheng-yi, ZHANG Qian-sheng, XIE Bo-lin, LI Xia. Hybrid recommendation by combining network-based algorithm and user preference [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(09): 29-35.
[3] WANG Li-ren, YU Zheng-tao, WANG Yan-bing, GAO Sheng-xiang, LI Xian-hui. Micro-blogging topic mining based on supervised LDA user interest model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(09): 36-41.
[4] MA Yu-feng, RUAN Tong. Entity set expansion based on LDA and label propagation [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(03): 20-27.
[5] ZHENG Yan, PANG Lin, BI Hui, LIU Wei, CHENG Gong. Feature selection algorithm based on sentiment topic model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(11): 74-81.
[6] JIAO Lu-lin, PENG Yan, LIN Yun. Comparative research on text knowledge discovery for network public opinion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(09): 62-68.
[7] SHI Cun-hui, LIN Hong-fei*. Tracking event microblogs: a streaming dynamic topic model [J]. J4, 2012, 47(5): 13-18.
[8] WANG Ling-xiu, CAO Ye-wen*. A load distribution algorithm based on an ant colony for multi-source multicast networks [J]. J4, 2011, 46(11): 28-32.
[9] ZHANG Guo-ying,SHA Yun,JIANG Hui-na . An improved KNN classification algorithm based on particle swarm optimization [J]. J4, 2006, 41(3): 34-36 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!