JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2017, Vol. 52 ›› Issue (7): 66-72.doi: 10.6040/j.issn.1671-9352.1.2016.123

Previous Articles     Next Articles

Short text clustering based on word embeddings and EMD

HUANG Dong, XU Bo, XU Kan, LIN Hong-fei*, YANG Zhi-hao   

  1. Information Retrieval Laboratory, Dalian University of Technology, Dalian 116023, Liaoning, China
  • Received:2016-11-25 Online:2017-07-20 Published:2017-07-07

Abstract: Short text clustering plays an important role in data mining. The traditional short text clustering model has some problems, such as high dimensionality、sparse data and lack of semantic information. To overcome the shortcomings of short text clustering caused by sparse features、semantic ambiguity、dynamics and other reasons, this paper presents a feature based on the word embeddings representation of text and short text clustering algorithm based on the moving distance of the characteristic words. Initially, the word embeddings that represents semantics of the feature word was gained through training in large-scale corpus with the Continous Skip-gram Model. Furthermore, use the Euclidean distance calculation feature word similarity. Additionally, EMD(Earth Movers Distance)was used to calculate the similarity between the short text. Finally, apply the similarity between the short text to Kmeans clustering algorithm implemented in the short text clustering. The evaluation results on three data sets show that the effect of this method is superior to traditional clustering algorithms.

Key words: earth movers distance, word embeddings, similarity calculation, short text, clustering

CLC Number: 

  • TP391.1
[1] CHEN Xin, ZHANG Yuqing, CAO Long, et al. An Improved Feature Selection Method for Chinese Short Texts Clustering Based on HowNet[J]. Lecture Notes in Electrical Engineering, 2014, 277:635-642.
[2] BOURAS C, TSOGKAS V. A clustering technique for news articles using WordNet[J]. Knowledge-Based Systems, 2012, 36(6):115-128.
[3] 吴舜尧, 邵峰晶, 王金龙,等. 融合语义资源和关键词的文本聚类[J]. 计算机工程, 2014, 40(4):223-227. WU Shunyao, SHAO Fengjing, WANG Jinlong, et al.Document clustering fused with semantic resources and key words[J].Computer Engineering, 2014, 40(4):223-227.
[4] 夏云庆, 黄锦辉, 张普. 中文网络聊天语言的奇异性与动态性研究[J]. 中文信息学报, 2007, 21(3):83-91. XIA Yunqing, HUANG Jinhui, ZHANG Pu. Toward Anomalous and dynamic nature of the chinese network chat language[J]. Journal of Chinese Information Processing, 2007, 21(3):83-91.
[5] 王春龙, 张敬旭. 基于LDA的改进K-means算法在文本聚类中的应用[J]. 计算机应用, 2014, 34(1):249-254. WANG Chunlong, ZHANG Jingxu. Improved K-means algorithm based on latent Dirichlet allocation for text clustering[J]. Journal of Computer Applications, 2014, 34(1):249-254.
[6] 汤秋莲. 基于BTM的短文本聚类[D].合肥:安徽大学,2014. TANG Qiulian. Short text clustering method based on BTM[D]. Hefei: Anhui University, 2014.
[7] 王少鹏, 彭岩, 王洁. 基于 LDA 的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报(理学版), 2014, 49(9): 129-134. WANG Shaopeng, PENG Yan, WANG Jie. Research of the text clustering based on LDA using in network public opinion analysis[J]. Journal of Shandong University(Natural Science), 2014, 49(9): 129-134.
[8] YIN Jianhua, WANG Jianyong. A dirichlet multinomial mixture model-based approach for short text clustering[C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 233-242.
[9] KEVIN P Murphy. Machine learning-A probabilistic perspective[M]. Cambridge: The MIT Press, 2012: 2-39.
[10] HINTON G E. Learning distributed representations of concepts[C] //Proceedings of the Eighth Annual Conference of the Cognitive Science Society. New Jersey: Lawrence Erlbaum Associates, 1986, 1: 12.
[11] BENGIO Y, SCHWENK H, SENÉCAL J S, et al. Neural Probabilistic Language Models[J]. Journal of Machine Learning Research, 2006, 3(6):1137-1155.
[12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C] //International Conference on Neural Information Processing Systems. Lake TahoeNevada: Conference and Workshop on Neural Information Processing Systems, 2013: 3111-3119.
[13] 陈磊磊. 不同距离测度的K-means文本聚类研究[J]. 软件, 2015(1):56-61. CHEN Leilei. Text clustering study with K-means algorithm of different distance measures[J].Software, 2015(1):56-61.
[14] RUBNER Y, TOMASI C, GUIBAS L J. The earth movers distance as a metric for image retrieval[J]. International Journal of Computer Vision, 2000, 40(2): 99-121.
[15] LANG K. Learning to filter netnews[C] //Proc. of 12th International Conference Machine Learning. California: International Conference on Machine Learning, 1995:331-339.
[16] Nibir Nayan Bora, Bhabani Shankar Prasad Mishra, Satchidananda Dehuri. Heuristic frequent term-based clustering of news headlines[J]. Procedia Technology, 2012, 6:436-443.
[17] ZELIKOVITZ S. Using background knowledge to improve text classification[D]. Rutgers, The State University of New Jersey, 2002.
[18] CAO Qimin, GUO Qiao, WANG Yongliang, et al. Text clustering using VSM with feature clusters[J]. Neural Computing and Applications, 2015, 26(4): 995-1003.
[19] 李国, 张春杰, 张志远,等. 一种基于加权LDA模型的文本聚类方法[J]. 中国民航大学学报, 2016, 34(2):46-51. LI Guo, ZHANG Chunjie, ZHANG Zhiyuan, et al. A text clustering method based on weighted LDA model[J]. Journal of Civil Aviation University of China, 2016, 34(2):46-51.
[20] YAN Xiaohui, GUO Jiafeng, LAN Yanyan, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. Rio de Janeiro, Brazil: ACM, 2013: 1445-1456.
[1] ZHANG Xiaoyuan, TIAN Yi, REN Zihan, DUAN Tianyu, YANG Siyuan, ZHANG Yuexuan. Application of topology neighborhood bases in density clustering algorithm [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2026, 61(5): 55-64.
[2] YANG Bin, SUN Jiannan, CAO Enguo, LI Zichuan, ZHOU Zhili. Forensic analysis of poster design infringement based on visual salient features [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2026, 61(3): 11-19.
[3] WANG Shunxia, HUANG Chengquan, CAI Jianghai, YANG Guiyan, LUO Senyan, ZHOU Lihua. Intuitionistic fuzzy locality preserving projection least squares twin support vector clustering [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2026, 61(3): 124-134.
[4] . Fuzzy rough c-means based on the knowledge measure [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2026, 61(1): 49-64.
[5] SUN Qing, YE Jun, ZENG Guangcai, SONG Suyang, WANG Yixin. Three-way K-means algorithm combining the bat algorithm and the improved compactness [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2026, 61(1): 65-75.
[6] DU Huiyuan, FAN Xiaoming. Vulnerable European option pricing in a regime-switching and Hawkes jump diffusion model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2025, 60(3): 22-32.
[7] ZHANG Chunhao, XIE Bin, XU Tongtong, ZHANG Ximei. Density peak clustering algorithm optimized by natural neighbor search [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2025, 60(1): 29-44.
[8] GUO Dongkai, ZHANG Qinran, LI Xiaonan, YI Huangjian. Fuzzy C-means clustering algorithm based on new shadowed sets [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2025, 60(1): 74-82.
[9] ZHENG Chenying, CHEN Yingyue, HOU Xianyu, JIANG Lianji, LIAO Liang. A neighbourhood granular fuzzy C-means clustering algorithm [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2024, 59(5): 35-44.
[10] ZHU Jin, FU Yu, GUAN Wenrui, WANG Pingxin. Perturbation three-way clustering based on natural nearest neighbors [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2024, 59(5): 45-51.
[11] Jiarui SUN,Mingjing DU. Fuzzy border-peeling clustering [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2024, 59(3): 27-36, 50.
[12] Huachang XU,Qian XU,Yulin ZHAO,Fengning LIANG,Kai XU,Hong ZHU. Prediction method of IDH1 mutation status of glioma based on improved EfficientNetV2 [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(7): 60-66.
[13] Hui MA,Lili WEI. Cluster analysis based on the hesitation triangle fuzzy correlation coefficient [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(12): 118-126.
[14] FAN Jia-chen, WANG Ping-xin, YANG Xi-bei. Density-sensitive spectral clustering based on three-way decision [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(1): 59-66.
[15] LI Xin-yu, FAN Hui, LIU Jing-lei. Robust clustering based on adaptive graph regularization and low-rank matrix decomposition [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(8): 21-38.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!