您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (1): 65-70.doi: 10.6040/j.issn.1671-9352.1.2015.042

• • 上一篇    下一篇

一种基于密度峰值发现的文本聚类算法

刘颖莹1,刘培玉1,王智昊1,李情情1,朱振方2   

  1. 1. 山东师范大学信息科学与工程学院, 山东 济南 250014; 2. 山东交通学院信息科学与电气工程学院, 山东 济南 250357
  • 收稿日期:2015-09-18 出版日期:2016-01-16 发布日期:2016-11-29
  • 作者简介:刘颖莹(1990— ),女,硕士研究生,研究方向为自然语言处理.E-mail: llyyyy1990@126.com
  • 基金资助:
    国家自然科学基金资助项目(61373148);国家社会科学基金资助项目(12BXW040);山东省自然基金资助项目(ZR2012FM038);山东省优秀中青年科学家奖励基金资助项目(BS2013DX033);教育部人文社科基金资助项目(14YJC860042);山东省社科规划项目(12BXWJ01);山东省高等学校科技计划项目(J12LN21)

A text clustering algorithm based on find of density peaks

LIU Ying-ying1, LIU Pei-yu1, WANG Zhi-hao1, LI Qing-qing1, ZHU Zhen-fang2   

  1. 1. School of Information Science and Engineering, Shandong Normal University, Jinan 250014, Shandong, China;
    2. School of Information Science and Electric Engineering, Shandong Jiaotong University, Jinan 250357, Shandong, China
  • Received:2015-09-18 Online:2016-01-16 Published:2016-11-29

摘要: 提出一种基于密度峰值发现的文本聚类算法,将文本的距离与密度计算转化为文本向量的相似度计算,实现基于密度峰值发现的文本聚类算法。该算法采用空间向量模型表示文本,用余弦公式进行相似度计算,然后求得每个文本的密度和距离。剔除噪音点后,选取聚类中心,将剩下的非中心点划分到距离其最近的聚类中心所在的类簇中去。通过多组对比试验,验证了本方法的可靠性和鲁棒性。

关键词: 密度, 文本聚类, 特征项, 向量距离

Abstract: A text clustering algorithm based on find of density peak was proposedin this paper. The algorithm was implemented by the calculation of text distance and density,which was in accordance with calculation of the text vector similarity. VSM(Vector Space Model)was used to express ducument to obtain the similarity calculation with cosine formula. The cucument work was to find the local density and the distance from points of higher density of each ducument, remove the noise points and select the cluster center. The remainednon-centralpoints were assigned into the cluster which was the nearest to the cluster center. According to several sets of contrast experiments, the density-based text clustering was improved to have an advantage of reliability and robustness.

Key words: ducument clustering, vector distance, density, feature term

中图分类号: 

  • TP391
[1] CHEN X, LIU W, QIU H, et al. APSCAN:a parameter free algorithm for clustering[J]. Pattern Recognition Letters, 2011, 32(7):973-986.
[2] 雷小锋,谢昆青,林帆,等. 一种基于K-Means局部最优性的高效聚类算法[J]. 软件学报,2008,19(7):1683-1692. LEI Xiaofeng, XIE Kunqing, LIN Fan, et al. An efficient clustering algorithm based on local optimality of K-Means[J]. Journal of Software, 2008, 19(7):1683-1692.
[3] DHILLON I S, MODHA D S. Concept decompositions for large sparse text data using clustering[J]. Machine learning, 2001, 42(1-2):143-175.
[4] 索红光,王玉伟. 一种用于文本聚类的改进k-means算法[J]. 山东大学学报(理学版), 2008,43(1):60-64. SUO Hongguang, WANG Yuwei. An improved k-means algorithm for document clustering[J].Journal of Shandong University(Natural Science), 2008, 43(1):60-64.
[5] MURTAGH F, CONTRERAS P. Algorithms for hierarchical clustering:an overview[J]. Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery, 2012, 2(1):86-97.
[6] 何晏成. 基于近邻传播和凝聚层次的文本聚类方法[D]. 哈尔滨:哈尔滨工业大学, 2010. HE Yancheng. A document clustering method based on affinity propagation and agglomerative hierarchical clustering[D]. Harbin: Harbin Institute of Technology, 2010.
[7] TRAN T N, DRAB K, DASZYKOWSKI M. Revised DBSCAN algorithm to cluster data with dense adjacent clusters[J]. Chemometrics and Intelligent Laboratory Systems, 2013, 120(2):92-96.
[8] JIANG H, LI J, YI S, et al. A new hybrid method based on partitioning-based DBSCAN and ant clustering[J]. Expert Systems with Applications, 2011, 38(8):9373-9381.
[9] 赵卫中,马慧芳,李志清,等. 一种结合主动学习的半监督文档聚类算法[J]. 软件学报,2012,23(6):1486-1499. ZHAO Weizhong, MA Huifang, LI Zhiqing, et al. Efficiently active learning for semisupervised document clustering[J]. Journal of Software, 2012, 23(6):1486-1499.
[10] 梁君玲,肖人岳,王向东. 一种改进的自适应蚁群聚类算法[J]. 计算机应用研究,2011,28(4):1263-1265. LIANG Junling, XIAO Renyue, WANG Xiangdong. Improved adaptive ant swam clusteringalgorithm[J].Application Research of Computers, 2011, 28(4):1263-1265.
[11] SZABO A, PRIOR A K F, DE CASTRO L N. The behavior of particles in the Particle Swarm Clustering algorithm[C] //Proceedings of Fuzzy Systems(FUZZ),2010 IEEE International Conference on. Barcelona, Spain: IEEE, 2010:1-7.
[12] 张云,冯博琴,麻首强,等. 蚁群-遗传融合的文本聚类算法[J]. 西安交通大学学报,2007,41(10):1146-1150. ZHANG Yun, FENG Boqin, MA Shouqiang, et al. Text clustering based on fusion of ant colony and genetic algorithms[J]. Journal of Xian Jiaotong University, 2007, 41(10):1146-1150.
[13] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[14] 刘露, 彭涛, 左万利,等. 一种基于聚类的PU主动文本分类方法[J]. 软件学报, 2013, 22(11):2571-2583. LIU Lu, PENG Tao, ZUO Wanli, et al. Clustering-based PU active text classification method[J]. Journal of Software, 2013, 22(11):2571-2583.
[15] 蔡岳,袁津生. 基于改进DBSCAN算法的文本聚类[J]. 计算机工程,2011,37(12): 50-52. CAI Yue, YUAN Jinsheng. Text clustering based on improved DBSCAN algorithm[J]. Computer Engineering, 2011, 37(12):50-52.
[16] 殷风景,肖卫东,葛斌,等. 一种面向网络话题发现的增量文本聚类算法[J]. 计算机应用研究,2011,28(1):54-57. YIN Fengjing, XIAO Weidong, GE Bin, et al. Incremental algorithm for clustering texts in internet-oriented topic detection[J]. Application Research of Computers, 2011, 28(1):54-57.
[1] 晏燕,郝晓弘. 差分隐私密度自适应网格划分发布方法[J]. 山东大学学报(理学版), 2018, 53(9): 12-22.
[2] 王丽丽,陈峥立. 关于Wigner-Yanase-Dyson斜信息的一些研究[J]. 山东大学学报(理学版), 2017, 52(8): 43-47.
[3] 胡学平,张红梅. WOD样本下密度函数核估计的收敛性[J]. 山东大学学报(理学版), 2017, 52(4): 21-25.
[4] 李永明,邓绍坚,蒋伟红. END样本下递归密度函数估计的相合性[J]. 山东大学学报(理学版), 2017, 52(11): 54-59.
[5] 焦潞林, 彭岩, 林云. 面向网络舆情的文本知识发现算法对比研究[J]. 山东大学学报(理学版), 2014, 49(09): 62-68.
[6] 李浩静,陈峥立*,梁丽丽. 关于Schr-dinger不确定性关系的研究[J]. 山东大学学报(理学版), 2014, 49(06): 67-73.
[7] 李秋英,张凤琴,王文娟. 不育控制下具有脉冲生育的单种群模型[J]. 山东大学学报(理学版), 2014, 49(06): 85-90.
[8] 胡春霞1,冯圣玉1*,艾洪奇2. 铜离子影响腺嘌呤内及其碱基对间质子转移的理论研究[J]. J4, 2013, 48(3): 1-7.
[9] 柳福提1,2,程晓洪1,张淑华3. 高压下3CSiC的电子结构、弹性与热力学性质[J]. J4, 2013, 48(3): 13-18.
[10] 刘小宁,杨英杰,吕庆章*. 野黄芩苷的密度泛函理论研究[J]. J4, 2012, 47(7): 20-25.
[11] 张淑华1,程晓洪2,柳福提2,3*. ZrC的电子结构与热力学性质[J]. J4, 2012, 47(7): 30-33.
[12] 倪展,吴群英*,施生塔. ND序列下最近邻密度估计的强相合速度[J]. J4, 2012, 47(12): 6-9.
[13] 张淑华1, 柳福提2,3,程晓洪3. 高压下CaF2晶体的电子结构与光学性质[J]. J4, 2011, 46(7): 39-42.
[14] 张淑华1,柳福提2,程晓洪2. CaF2晶体弹性与热力学性质的研究[J]. J4, 2011, 46(3): 23-25.
[15] 赵先锋1,张华2. 自相互作用对超子星转变密度的影响[J]. J4, 2010, 45(9): 96-100.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!