您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2017, Vol. 52 ›› Issue (7): 66-72.doi: 10.6040/j.issn.1671-9352.1.2016.123

• • 上一篇    下一篇

基于词向量和EMD距离的短文本聚类

黄栋,徐博,许侃,林鸿飞*,杨志豪   

  1. 大连理工大学信息检索研究室, 辽宁 大连 116023
  • 收稿日期:2016-11-25 出版日期:2017-07-20 发布日期:2017-07-07
  • 通讯作者: 林鸿飞(1962— ),男,博士,教授,研究方向为搜索引擎、文本挖掘、情感计算和自然语言处理.E-mail:hflin@dlut.edu.cn E-mail:59963695@qq.com
  • 作者简介:黄栋(1981— ),男,硕士研究生,助理研究员,研究方向为自然语言处理、文本挖掘.E-mail: 59963695@qq.com
  • 基金资助:
    国家自然科学基金资助项目(61572102,61602078,61562080);国家高技术研究发展计划(863)资助项目(2006AA01Z151);辽宁省自然科学基金资助项目(201202031,2014020003);教育部留学回国人员科研启动基金和高等学校博士学科点专项科研基金资助课题(20090041110002);中央高校基本科研业务费专项资金资助

Short text clustering based on word embeddings and EMD

HUANG Dong, XU Bo, XU Kan, LIN Hong-fei*, YANG Zhi-hao   

  1. Information Retrieval Laboratory, Dalian University of Technology, Dalian 116023, Liaoning, China
  • Received:2016-11-25 Online:2017-07-20 Published:2017-07-07

摘要: 短文本聚类在数据挖掘中发挥着重要的作用,传统的短文本聚类模型存在维度高、数据稀疏和缺乏语义信息等问题,针对互联网短文本特征稀疏、语义存在奇异性和动态性而导致的短文本聚类性能较差的问题,提出了一种基于特征词向量的文本表示和基于特征词移动距离的短文本聚类算法。首先使用Skip-gram模型(Continuous Skip-gram Model)在大规模语料中训练得到表示特征词语义的词向量;然后利用欧式距离计算特征词相似度,引入EMD(Earth Movers Distance)来计算短文本间的相似度;最后将其应用到Kmeans聚类算法中实现短文本聚类。在3个数据集上进行的评测结果表明,效果优于传统的聚类算法。

关键词: 词向量, 聚类, EMD距离, 相似度计算, 短文本

Abstract: Short text clustering plays an important role in data mining. The traditional short text clustering model has some problems, such as high dimensionality、sparse data and lack of semantic information. To overcome the shortcomings of short text clustering caused by sparse features、semantic ambiguity、dynamics and other reasons, this paper presents a feature based on the word embeddings representation of text and short text clustering algorithm based on the moving distance of the characteristic words. Initially, the word embeddings that represents semantics of the feature word was gained through training in large-scale corpus with the Continous Skip-gram Model. Furthermore, use the Euclidean distance calculation feature word similarity. Additionally, EMD(Earth Movers Distance)was used to calculate the similarity between the short text. Finally, apply the similarity between the short text to Kmeans clustering algorithm implemented in the short text clustering. The evaluation results on three data sets show that the effect of this method is superior to traditional clustering algorithms.

Key words: earth movers distance, word embeddings, similarity calculation, short text, clustering

中图分类号: 

  • TP391.1
[1] CHEN Xin, ZHANG Yuqing, CAO Long, et al. An Improved Feature Selection Method for Chinese Short Texts Clustering Based on HowNet[J]. Lecture Notes in Electrical Engineering, 2014, 277:635-642.
[2] BOURAS C, TSOGKAS V. A clustering technique for news articles using WordNet[J]. Knowledge-Based Systems, 2012, 36(6):115-128.
[3] 吴舜尧, 邵峰晶, 王金龙,等. 融合语义资源和关键词的文本聚类[J]. 计算机工程, 2014, 40(4):223-227. WU Shunyao, SHAO Fengjing, WANG Jinlong, et al.Document clustering fused with semantic resources and key words[J].Computer Engineering, 2014, 40(4):223-227.
[4] 夏云庆, 黄锦辉, 张普. 中文网络聊天语言的奇异性与动态性研究[J]. 中文信息学报, 2007, 21(3):83-91. XIA Yunqing, HUANG Jinhui, ZHANG Pu. Toward Anomalous and dynamic nature of the chinese network chat language[J]. Journal of Chinese Information Processing, 2007, 21(3):83-91.
[5] 王春龙, 张敬旭. 基于LDA的改进K-means算法在文本聚类中的应用[J]. 计算机应用, 2014, 34(1):249-254. WANG Chunlong, ZHANG Jingxu. Improved K-means algorithm based on latent Dirichlet allocation for text clustering[J]. Journal of Computer Applications, 2014, 34(1):249-254.
[6] 汤秋莲. 基于BTM的短文本聚类[D].合肥:安徽大学,2014. TANG Qiulian. Short text clustering method based on BTM[D]. Hefei: Anhui University, 2014.
[7] 王少鹏, 彭岩, 王洁. 基于 LDA 的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报(理学版), 2014, 49(9): 129-134. WANG Shaopeng, PENG Yan, WANG Jie. Research of the text clustering based on LDA using in network public opinion analysis[J]. Journal of Shandong University(Natural Science), 2014, 49(9): 129-134.
[8] YIN Jianhua, WANG Jianyong. A dirichlet multinomial mixture model-based approach for short text clustering[C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 233-242.
[9] KEVIN P Murphy. Machine learning-A probabilistic perspective[M]. Cambridge: The MIT Press, 2012: 2-39.
[10] HINTON G E. Learning distributed representations of concepts[C] //Proceedings of the Eighth Annual Conference of the Cognitive Science Society. New Jersey: Lawrence Erlbaum Associates, 1986, 1: 12.
[11] BENGIO Y, SCHWENK H, SENÉCAL J S, et al. Neural Probabilistic Language Models[J]. Journal of Machine Learning Research, 2006, 3(6):1137-1155.
[12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C] //International Conference on Neural Information Processing Systems. Lake TahoeNevada: Conference and Workshop on Neural Information Processing Systems, 2013: 3111-3119.
[13] 陈磊磊. 不同距离测度的K-means文本聚类研究[J]. 软件, 2015(1):56-61. CHEN Leilei. Text clustering study with K-means algorithm of different distance measures[J].Software, 2015(1):56-61.
[14] RUBNER Y, TOMASI C, GUIBAS L J. The earth movers distance as a metric for image retrieval[J]. International Journal of Computer Vision, 2000, 40(2): 99-121.
[15] LANG K. Learning to filter netnews[C] //Proc. of 12th International Conference Machine Learning. California: International Conference on Machine Learning, 1995:331-339.
[16] Nibir Nayan Bora, Bhabani Shankar Prasad Mishra, Satchidananda Dehuri. Heuristic frequent term-based clustering of news headlines[J]. Procedia Technology, 2012, 6:436-443.
[17] ZELIKOVITZ S. Using background knowledge to improve text classification[D]. Rutgers, The State University of New Jersey, 2002.
[18] CAO Qimin, GUO Qiao, WANG Yongliang, et al. Text clustering using VSM with feature clusters[J]. Neural Computing and Applications, 2015, 26(4): 995-1003.
[19] 李国, 张春杰, 张志远,等. 一种基于加权LDA模型的文本聚类方法[J]. 中国民航大学学报, 2016, 34(2):46-51. LI Guo, ZHANG Chunjie, ZHANG Zhiyuan, et al. A text clustering method based on weighted LDA model[J]. Journal of Civil Aviation University of China, 2016, 34(2):46-51.
[20] YAN Xiaohui, GUO Jiafeng, LAN Yanyan, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. Rio de Janeiro, Brazil: ACM, 2013: 1445-1456.
[1] 陈鑫,薛云,卢昕,李万理,赵洪雅,胡晓晖. 基于保序子矩阵和频繁序列模式挖掘的文本情感特征提取方法[J]. 山东大学学报(理学版), 2018, 53(3): 36-45.
[2] 杜漫,徐学可,杜慧,伍大勇,刘悦,程学旗. 面向情绪分类的情绪词向量学习[J]. 山东大学学报(理学版), 2017, 52(7): 52-58.
[3] 施寒潇,厉小军,郝腾达,柳虹,朱柳青. 微博短文本的情绪分析方法[J]. 山东大学学报(理学版), 2017, 52(7): 80-90.
[4] 许忠好,李天奇. 基于复杂网络的中国股票市场统计特征分析[J]. 山东大学学报(理学版), 2017, 52(5): 41-48.
[5] 姚亮,洪宇,刘昊,刘乐,姚建民. 基于语义分布相似度的翻译模型领域自适应研究[J]. 山东大学学报(理学版), 2016, 51(7): 43-50.
[6] 翟鹏,李登道. 基于高斯隶属度的包容性指标模糊聚类算法[J]. 山东大学学报(理学版), 2016, 51(5): 102-105.
[7] 刘颖莹,刘培玉,王智昊,李情情,朱振方. 一种基于密度峰值发现的文本聚类算法[J]. 山东大学学报(理学版), 2016, 51(1): 65-70.
[8] 范意兴, 郭岩, 李希鹏, 赵岭, 刘悦, 俞晓明, 程学旗. 一种基于网页块特征的多级网页聚类方法[J]. 山东大学学报(理学版), 2015, 50(07): 1-8.
[9] 祝瑞. 一种基于信任度的电子商务社区聚类模型[J]. 山东大学学报(理学版), 2015, 50(05): 18-22.
[10] 马成龙, 姜亚松, 李艳玲, 张艳, 颜永红. 基于词矢量相似度的短文本分类[J]. 山东大学学报(理学版), 2014, 49(12): 18-22.
[11] 杨阳, 刘龙飞, 魏现辉, 林鸿飞. 基于词向量的情感新词发现方法[J]. 山东大学学报(理学版), 2014, 49(11): 51-58.
[12] 焦潞林, 彭岩, 林云. 面向网络舆情的文本知识发现算法对比研究[J]. 山东大学学报(理学版), 2014, 49(09): 62-68.
[13] 张聪, 于洪. 一种三支决策软增量聚类算法[J]. 山东大学学报(理学版), 2014, 49(08): 40-47.
[14] 万润泽1,雷建军1,袁操2. 基于模糊聚类理论的无线传感器节点休眠优化策略[J]. J4, 2013, 48(09): 17-21.
[15] 杜世强1,石玉清2,王维兰1,马明1. 基于流形正则化判别的因子分解[J]. J4, 2013, 48(05): 63-69.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!