JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2016, Vol. 51 ›› Issue (1): 65-70.doi: 10.6040/j.issn.1671-9352.1.2015.042

Previous Articles     Next Articles

A text clustering algorithm based on find of density peaks

LIU Ying-ying1, LIU Pei-yu1, WANG Zhi-hao1, LI Qing-qing1, ZHU Zhen-fang2   

  1. 1. School of Information Science and Engineering, Shandong Normal University, Jinan 250014, Shandong, China;
    2. School of Information Science and Electric Engineering, Shandong Jiaotong University, Jinan 250357, Shandong, China
  • Received:2015-09-18 Online:2016-01-16 Published:2016-11-29

Abstract: A text clustering algorithm based on find of density peak was proposedin this paper. The algorithm was implemented by the calculation of text distance and density,which was in accordance with calculation of the text vector similarity. VSM(Vector Space Model)was used to express ducument to obtain the similarity calculation with cosine formula. The cucument work was to find the local density and the distance from points of higher density of each ducument, remove the noise points and select the cluster center. The remainednon-centralpoints were assigned into the cluster which was the nearest to the cluster center. According to several sets of contrast experiments, the density-based text clustering was improved to have an advantage of reliability and robustness.

Key words: ducument clustering, vector distance, density, feature term

CLC Number: 

  • TP391
[1] CHEN X, LIU W, QIU H, et al. APSCAN:a parameter free algorithm for clustering[J]. Pattern Recognition Letters, 2011, 32(7):973-986.
[2] 雷小锋,谢昆青,林帆,等. 一种基于K-Means局部最优性的高效聚类算法[J]. 软件学报,2008,19(7):1683-1692. LEI Xiaofeng, XIE Kunqing, LIN Fan, et al. An efficient clustering algorithm based on local optimality of K-Means[J]. Journal of Software, 2008, 19(7):1683-1692.
[3] DHILLON I S, MODHA D S. Concept decompositions for large sparse text data using clustering[J]. Machine learning, 2001, 42(1-2):143-175.
[4] 索红光,王玉伟. 一种用于文本聚类的改进k-means算法[J]. 山东大学学报(理学版), 2008,43(1):60-64. SUO Hongguang, WANG Yuwei. An improved k-means algorithm for document clustering[J].Journal of Shandong University(Natural Science), 2008, 43(1):60-64.
[5] MURTAGH F, CONTRERAS P. Algorithms for hierarchical clustering:an overview[J]. Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery, 2012, 2(1):86-97.
[6] 何晏成. 基于近邻传播和凝聚层次的文本聚类方法[D]. 哈尔滨:哈尔滨工业大学, 2010. HE Yancheng. A document clustering method based on affinity propagation and agglomerative hierarchical clustering[D]. Harbin: Harbin Institute of Technology, 2010.
[7] TRAN T N, DRAB K, DASZYKOWSKI M. Revised DBSCAN algorithm to cluster data with dense adjacent clusters[J]. Chemometrics and Intelligent Laboratory Systems, 2013, 120(2):92-96.
[8] JIANG H, LI J, YI S, et al. A new hybrid method based on partitioning-based DBSCAN and ant clustering[J]. Expert Systems with Applications, 2011, 38(8):9373-9381.
[9] 赵卫中,马慧芳,李志清,等. 一种结合主动学习的半监督文档聚类算法[J]. 软件学报,2012,23(6):1486-1499. ZHAO Weizhong, MA Huifang, LI Zhiqing, et al. Efficiently active learning for semisupervised document clustering[J]. Journal of Software, 2012, 23(6):1486-1499.
[10] 梁君玲,肖人岳,王向东. 一种改进的自适应蚁群聚类算法[J]. 计算机应用研究,2011,28(4):1263-1265. LIANG Junling, XIAO Renyue, WANG Xiangdong. Improved adaptive ant swam clusteringalgorithm[J].Application Research of Computers, 2011, 28(4):1263-1265.
[11] SZABO A, PRIOR A K F, DE CASTRO L N. The behavior of particles in the Particle Swarm Clustering algorithm[C] //Proceedings of Fuzzy Systems(FUZZ),2010 IEEE International Conference on. Barcelona, Spain: IEEE, 2010:1-7.
[12] 张云,冯博琴,麻首强,等. 蚁群-遗传融合的文本聚类算法[J]. 西安交通大学学报,2007,41(10):1146-1150. ZHANG Yun, FENG Boqin, MA Shouqiang, et al. Text clustering based on fusion of ant colony and genetic algorithms[J]. Journal of Xian Jiaotong University, 2007, 41(10):1146-1150.
[13] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[14] 刘露, 彭涛, 左万利,等. 一种基于聚类的PU主动文本分类方法[J]. 软件学报, 2013, 22(11):2571-2583. LIU Lu, PENG Tao, ZUO Wanli, et al. Clustering-based PU active text classification method[J]. Journal of Software, 2013, 22(11):2571-2583.
[15] 蔡岳,袁津生. 基于改进DBSCAN算法的文本聚类[J]. 计算机工程,2011,37(12): 50-52. CAI Yue, YUAN Jinsheng. Text clustering based on improved DBSCAN algorithm[J]. Computer Engineering, 2011, 37(12):50-52.
[16] 殷风景,肖卫东,葛斌,等. 一种面向网络话题发现的增量文本聚类算法[J]. 计算机应用研究,2011,28(1):54-57. YIN Fengjing, XIAO Weidong, GE Bin, et al. Incremental algorithm for clustering texts in internet-oriented topic detection[J]. Application Research of Computers, 2011, 28(1):54-57.
[1] YAN Yan, HAO Xiao-hong. Differential privacy partitioning algorithm based on adaptive density grids [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 12-22.
[2] WANG Li-li, CHEN Zheng-li. Some research about Wigner-Yanase-Dyson skew information [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(8): 43-47.
[3] LI Qiu-ying, ZHANG Feng-qin, WANG Wen-juan. Single-species model with impulsive birth and contraception control#br# [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(06): 85-90.
[4] LI Hao-jing, CHEN Zheng-li*, LIANG Li-li. Note on the Schrdinger uncertainty relation#br# [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(06): 67-73.
[5] LIU Fu-ti1,2, CHENG Xiao-hong1, ZHANG Shu-hua3. The electronic structure,elastic and thermodynamic properties of 3C-SiC at high pressure [J]. J4, 2013, 48(3): 13-18.
[6] ZHANG Shu-hua1, CHENG Xiao-hong2, LIU Fu-ti2,3*. Electronic structure and thermodynamic properties of ZrC [J]. J4, 2012, 47(7): 30-33.
[7] NI Zhan, WU Qun-ying*, SHI Sheng-ta. The rate of strong consistency of nearest neighbor density  estimator for ND samples [J]. J4, 2012, 47(12): 6-9.
[8] ZHANG Shu-hua1, LIU Fu-ti2,3, CHENG Xiao-hong3. Electronic structure and optical property of CaF2under high pressure [J]. J4, 2011, 46(7): 39-42.
[9] ZHANG Shu-hua1, LIU Fu-ti2, CHENG Xiao-hong2. Study of the elastic and thermodynamic properties of CaF2 [J]. J4, 2011, 46(3): 23-25.
[10] ZHAO Xian-feng1, ZHANG Hua2. Effect of self-interactions on the transition density of hyperon stars [J]. J4, 2010, 45(9): 96-100.
[11] XIE Juan-ying1, 2, ZHANG Yan1, XIE Wei-xin2, 3, GAO Xin-bo2. A novel rough K-means clustering algorithm based on the weight of density [J]. J4, 2010, 45(7): 1-6.
[12] LIU Fu-Ti, ZHANG Shu-Hua, GAO Ceng-Hui. The optical property of GaN under high pressure [J]. J4, 2010, 45(1): 69-72.
[13] LIU Tian-Bao, BANG Yan-Fen. QSAR study of the selected organic compounds anesthesia
activities to tadpoles
[J]. J4, 2009, 44(9): 12-16.
[14] YUAN Pan-Chi, Zhang-Xin-Fang. Study of the conditional probability truth degree of formulas in the Gödel logic system [J]. J4, 2009, 44(9): 70-74.
[15] . SIS model with a feedback mechanism on sparsely distributed regular  lattices [J]. J4, 2009, 44(7): 49-54.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!