JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2017, Vol. 52 ›› Issue (7): 66-72.doi: 10.6040/j.issn.1671-9352.1.2016.123

Previous Articles     Next Articles

Short text clustering based on word embeddings and EMD

HUANG Dong, XU Bo, XU Kan, LIN Hong-fei*, YANG Zhi-hao   

  1. Information Retrieval Laboratory, Dalian University of Technology, Dalian 116023, Liaoning, China
  • Received:2016-11-25 Online:2017-07-20 Published:2017-07-07

Abstract: Short text clustering plays an important role in data mining. The traditional short text clustering model has some problems, such as high dimensionality、sparse data and lack of semantic information. To overcome the shortcomings of short text clustering caused by sparse features、semantic ambiguity、dynamics and other reasons, this paper presents a feature based on the word embeddings representation of text and short text clustering algorithm based on the moving distance of the characteristic words. Initially, the word embeddings that represents semantics of the feature word was gained through training in large-scale corpus with the Continous Skip-gram Model. Furthermore, use the Euclidean distance calculation feature word similarity. Additionally, EMD(Earth Movers Distance)was used to calculate the similarity between the short text. Finally, apply the similarity between the short text to Kmeans clustering algorithm implemented in the short text clustering. The evaluation results on three data sets show that the effect of this method is superior to traditional clustering algorithms.

Key words: earth movers distance, word embeddings, similarity calculation, short text, clustering

CLC Number: 

  • TP391.1
[1] CHEN Xin, ZHANG Yuqing, CAO Long, et al. An Improved Feature Selection Method for Chinese Short Texts Clustering Based on HowNet[J]. Lecture Notes in Electrical Engineering, 2014, 277:635-642.
[2] BOURAS C, TSOGKAS V. A clustering technique for news articles using WordNet[J]. Knowledge-Based Systems, 2012, 36(6):115-128.
[3] 吴舜尧, 邵峰晶, 王金龙,等. 融合语义资源和关键词的文本聚类[J]. 计算机工程, 2014, 40(4):223-227. WU Shunyao, SHAO Fengjing, WANG Jinlong, et al.Document clustering fused with semantic resources and key words[J].Computer Engineering, 2014, 40(4):223-227.
[4] 夏云庆, 黄锦辉, 张普. 中文网络聊天语言的奇异性与动态性研究[J]. 中文信息学报, 2007, 21(3):83-91. XIA Yunqing, HUANG Jinhui, ZHANG Pu. Toward Anomalous and dynamic nature of the chinese network chat language[J]. Journal of Chinese Information Processing, 2007, 21(3):83-91.
[5] 王春龙, 张敬旭. 基于LDA的改进K-means算法在文本聚类中的应用[J]. 计算机应用, 2014, 34(1):249-254. WANG Chunlong, ZHANG Jingxu. Improved K-means algorithm based on latent Dirichlet allocation for text clustering[J]. Journal of Computer Applications, 2014, 34(1):249-254.
[6] 汤秋莲. 基于BTM的短文本聚类[D].合肥:安徽大学,2014. TANG Qiulian. Short text clustering method based on BTM[D]. Hefei: Anhui University, 2014.
[7] 王少鹏, 彭岩, 王洁. 基于 LDA 的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报(理学版), 2014, 49(9): 129-134. WANG Shaopeng, PENG Yan, WANG Jie. Research of the text clustering based on LDA using in network public opinion analysis[J]. Journal of Shandong University(Natural Science), 2014, 49(9): 129-134.
[8] YIN Jianhua, WANG Jianyong. A dirichlet multinomial mixture model-based approach for short text clustering[C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 233-242.
[9] KEVIN P Murphy. Machine learning-A probabilistic perspective[M]. Cambridge: The MIT Press, 2012: 2-39.
[10] HINTON G E. Learning distributed representations of concepts[C] //Proceedings of the Eighth Annual Conference of the Cognitive Science Society. New Jersey: Lawrence Erlbaum Associates, 1986, 1: 12.
[11] BENGIO Y, SCHWENK H, SENÉCAL J S, et al. Neural Probabilistic Language Models[J]. Journal of Machine Learning Research, 2006, 3(6):1137-1155.
[12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C] //International Conference on Neural Information Processing Systems. Lake TahoeNevada: Conference and Workshop on Neural Information Processing Systems, 2013: 3111-3119.
[13] 陈磊磊. 不同距离测度的K-means文本聚类研究[J]. 软件, 2015(1):56-61. CHEN Leilei. Text clustering study with K-means algorithm of different distance measures[J].Software, 2015(1):56-61.
[14] RUBNER Y, TOMASI C, GUIBAS L J. The earth movers distance as a metric for image retrieval[J]. International Journal of Computer Vision, 2000, 40(2): 99-121.
[15] LANG K. Learning to filter netnews[C] //Proc. of 12th International Conference Machine Learning. California: International Conference on Machine Learning, 1995:331-339.
[16] Nibir Nayan Bora, Bhabani Shankar Prasad Mishra, Satchidananda Dehuri. Heuristic frequent term-based clustering of news headlines[J]. Procedia Technology, 2012, 6:436-443.
[17] ZELIKOVITZ S. Using background knowledge to improve text classification[D]. Rutgers, The State University of New Jersey, 2002.
[18] CAO Qimin, GUO Qiao, WANG Yongliang, et al. Text clustering using VSM with feature clusters[J]. Neural Computing and Applications, 2015, 26(4): 995-1003.
[19] 李国, 张春杰, 张志远,等. 一种基于加权LDA模型的文本聚类方法[J]. 中国民航大学学报, 2016, 34(2):46-51. LI Guo, ZHANG Chunjie, ZHANG Zhiyuan, et al. A text clustering method based on weighted LDA model[J]. Journal of Civil Aviation University of China, 2016, 34(2):46-51.
[20] YAN Xiaohui, GUO Jiafeng, LAN Yanyan, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. Rio de Janeiro, Brazil: ACM, 2013: 1445-1456.
[1] CUI Zhao-yang, SUN Jia-qi, XU Song-yan, JIANG Xin. A secure clustering algorithm of Ad Hoc network for colony UAVs [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(7): 51-59.
[2] CHEN Xin, XUE Yun, LU Xin, LI Wan-li, ZHAO Hong-ya, HU Xiao-hui. Text feature extraction method for sentiment analysis based on order-preserving submatrix and frequent sequential pattern mining [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 36-45.
[3] SHI Han-xiao, LI Xiao-jun, HAO Teng-da, LIU Hong, ZHU Liu-qing. Emotion analysis on Microblog short text [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 80-90.
[4] XU Zhong-hao, LI Tian-qi. Analysison statistical characteristic of Chinese stock market based on complex networks [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(5): 41-48.
[5] ZHAI Peng, LI Deng-dao. The fuzzy clustering algorithm based on inclusion index of Gausian membership function [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(5): 102-105.
[6] LIU Ying-ying, LIU Pei-yu, WANG Zhi-hao, LI Qing-qing, ZHU Zhen-fang. A text clustering algorithm based on find of density peaks [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(1): 65-70.
[7] FAN Yi-xing, GUO Yan, LI Xi-peng, ZHAO Ling, LIU Yue, YU Xiao-ming, CHENG Xue-qi. A multi-level page clustering method based on page segmentation [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(07): 1-8.
[8] ZHU Rui. E-commerce community clustering model based on trust [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(05): 18-22.
[9] MA Cheng-long, JIANG Ya-song, LI Yan-ling, ZHANG Yan, YAN Yong-hong. Short text classification based on word embedding similarity [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(12): 18-22.
[10] JIAO Lu-lin, PENG Yan, LIN Yun. Comparative research on text knowledge discovery for network public opinion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(09): 62-68.
[11] ZHANG Cong, YU Hong. An incremental three-way decisions soft clustering algorithm [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(08): 40-47.
[12] WAN Run-ze1, LEI Jian-jun1, YUAN Cao2. An optimization strategy for managing dormant nodes in wireless sensor networks base on fuzzy clustering [J]. J4, 2013, 48(09): 17-21.
[13] DU Shi-qiang1, SHI Yu-qing2, WANG Wei-lan1, MA Ming1. Manifold regularized-based discriminant concept factorization [J]. J4, 2013, 48(05): 63-69.
[14] ZHAO Jing, MA Qin, CUI Yu-quan. Comparative study on macro-economic regionalization: an application of biclustering algorithm [J]. J4, 2012, 47(9): 71-77.
[15] FENG Xin-ying1,2, JI Hua1,2, ZHANG Hua-xiang1,2. Multi-label RBF neural networks learning algorithm  based on clustering optimization [J]. J4, 2012, 47(5): 63-67.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!