您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2025, Vol. 60 ›› Issue (1): 29-44.doi: 10.6040/j.issn.1671-9352.7.2023.4130

• • 上一篇    

基于自然邻居搜索优化策略的密度峰值聚类算法

张春昊1,解滨1,2,3*,徐童童1,张喜梅1   

  1. 1.河北师范大学计算机与网络空间安全学院, 河北 石家庄 050024;2.供应链大数据分析与数据安全河北省工程研究中心(河北师范大学), 河北 石家庄 050024;3.河北省网络与信息安全重点实验室(河北师范大学), 河北 石家庄 050024
  • 发布日期:2025-01-10
  • 通讯作者: 解滨(1976— ),男,教授,博士,主要研究方向为机器学习、粒计算与智能数据分析. E-mail: xiebin_hebtu@126.com
  • 作者简介:张春昊(1996— ),男,硕士研究生,主要研究方向为机器学习与数据挖掘. E-mail: zhangchunhao_hebtu@163.com*通信作者:解滨(1976— ),男,教授,博士,主要研究方向为机器学习、粒计算与智能数据分析. E-mail: xiebin_hebtu@126.com
  • 基金资助:
    国家自然科学基金资助项目(62076088);中央科研院所基本科研业务费资助项目(SK202324);河北师范大学技术创新基金资助项目(L2020K09)

Density peak clustering algorithm optimized by natural neighbor search

ZHANG Chunhao1, XIE Bin1,2,3*, XU Tongtong1, ZHANG Ximei1   

  1. 1. College of Computer and Cyber Security, Hebei Normal University, Shijiazhuang 050024, Hebei, China;
    2. Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics &
    Data Security, Hebei Normal University, Shijiazhuang 050024, Hebei, China;
    3. Hebei Provincial Key Laboratory of Network &
    Information Security, Hebei Normal University, Shijiazhuang 050024, Hebei, China
  • Published:2025-01-10

摘要: 结合自然邻居搜索算法改进了密度峰值聚类(clustering by fast search and find of density peaks, CFSFDP)算法存在的一系列问题,提出基于自然邻居搜索优化策略的密度峰值聚类(density peak clustering algorithm optimized by natural neighbor search, NaN-CFSFDP)算法。基于自然邻居搜索算法提出了一种离群样本的检测方法,针对CFSFDP算法中截断距离dc人工准确取值较难的问题,结合自然邻居搜索算法改进了dc的计算方式,实现了dc的自动取值。重新设计并统一了CFSFDP算法的样本密度度量规则,使其更关注每个样本的局部信息。由于数据集中因类簇间的密度差异大,密度峰值点集中于稠密簇使得簇丢失,因此提出样本共享自然邻居和类簇共享自然邻居的概念,构造新的类簇融合算法。合成数据集和真实数据集上的实验结果表明,在大多数情况下,NaN-CFSFDP算法在聚类性能上优于或至少与比较方法相当,且与CFSFDP算法及其改进算法相比参数更少。

关键词: 密度峰值, 自然邻居, 聚类, 类簇融合, 离群样本, 截断距离

Abstract: We combine the natural neighbor search algorithm to improve a series of problems of the density peaks clustering(CFSFDP)algorithm, and propose the NaN-CFSFDP algorithm. First, an outlier samples detection method is proposed based on the natural neighbor search algorithm. Then, for the problem that the truncation distance dc is difficult to be taken accurately manually in the CFSFDP algorithm, the calculation of dc is improved in combination with the natural neighbor search algorithm, and the automatic taking of dc is realized. The metric rule of the sample density of the CFSFDP algorithm is redesigned and unified to make it pay more attention to the local information of each sample. Finally, to address the problem that the density peak points in the dataset may be concentrated in dense clusters due to the large density difference between clusters, which leads to cluster loss, the concepts of shared natural neighbors for samples and shared natural neighbors for clusters are proposed to construct a new cluster fusion algorithm. Experimental results on synthetic and real datasets show that the algorithm outperforms or is at least comparable to the comparative method in terms of clustering performance in most cases and has fewer parameters compared to CFSFDP algorithm and its improvements.

Key words: density peaks, natural neighbor, clustering, cluster fusion, outlier samples, truncation distance

中图分类号: 

  • TP391
[1] OKTAR Y, TURKAN M. A review of sparsity-based clustering methods[J]. Signal Processing, 2018, 148:20-30.
[2] SAXENA A, PRASAD M, GUPTA A, et al. A review of clustering techniques and developments[J]. Neurocomputing, 2017, 267:664-681.
[3] MACQUEEN J B. Some methods for classification and analysis of multivariate observations[C] //Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967:281-297.
[4] BEZDEK J C, EHRLICH R, FULL W. FCM: the fuzzy c-means clustering algorithm[J]. Computers and Geosciences, 1984, 10(2/3):191-203.
[5] WANG Wei, YANG Jiong, MUNTZ R. STING: a statistical information grid approach to spatial data mining[C] //Proceedings of the 23rd International Conference on Very Large Data Bases. Athens: ACM, 1997:186-195.
[6] GUHA S, RASTOGI R, SHIM K. Cure: an efficient clustering algorithm for large databases[J]. Information Systems, 2001, 26(1):35-58.
[7] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C] //Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Portland Oregon, AAAI, 1996:226-231.
[8] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[9] 谢娟英,高红超,谢维信. K近邻优化的密度峰值快速搜索聚类算法[J]. 中国科学(信息科学), 2016, 46(2):258-280. XIE Juanying, GAO Hongchao, XIE Weixin. K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset[J]. Scientia Sinica(Informationis), 2016, 46(2):258-280.
[10] XIE Juanying, GAO Hongchao, XIE Weixin, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information Sciences, 2016, 354:19-40.
[11] JIANG Jianhua, CHEN Yujun, MENG Xianqiu, et al. A novel density peaks clustering algorithm based on k nearest neighbors for improving assignment process[J]. Physica A, 2019, 523:702-713.
[12] ZHANG Rui, DU Tao, QU Shouning, et al. Adaptive density-based clustering algorithm with shared KNN conflict game[J]. Information Sciences, 2021, 565:344-369.
[13] LIU Yaohui, MA Zhengming, YU Fang. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy[J]. Knowledge-based Systems, 2017, 133:208-220.
[14] BAI Liang, CHENG Xueqi, LIANG Jiye, et al. Fast density clustering strategies based on the k-means algorithm[J]. Pattern Recognition, 2017, 71:375-386.
[15] LIU Rui, WANG Hong, YU Xiaomei. Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Information Sciences, 2018, 450:200-226.
[16] 张新元,贠卫国. 共享K近邻和多分配策略的密度峰值聚类算法[J]. 小型微型计算机系统, 2023, 44(1):75-82. ZHANG Xinyuan, YUN Weiguo. Sharing K-nearest neighbors and multiple assignment policies density peaks clustering algorithm[J]. Journal of Chinese Computer Systems, 2023, 44(1):75-82.
[17] ZHANG Chunhao, XIE Bin, ZHANG Yiran. Reverse-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Chinese Journal of Electronics, 2023, 32(6):1341-1354.
[18] 徐童童,解滨,张喜梅,等. 自适应聚类中心策略优化的密度峰值聚类算法[J]. 计算机工程与应用,2023,59(21):91-101. XU Tongtong, XIE Bin, ZHANG Ximei, et al. Density peak clustering algorithm optimized by adaptive clustering centers strategy[J]. Computer Engineering and Applications, 2023, 59(21):91-101.
[19] 张春昊,解滨,张喜梅,等. 一种结合自适应近邻与密度峰值的加权模糊聚类算法[J]. 小型微型计算机系统,2023, 44(9):1974-1982. ZHANG Chunhao, XIE Bin, ZHANG Ximei, et al. Weighted fuzzy clustering algorithm combining adaptive nearest neighbors and density peaks[J]. Journal of Chinese Computer Systems, 2023, 44(9):1974-1982.
[20] CHEN Yewang, TANG Shengyu, PEI Songwen. DHeat: a density heat-based algorithm for clustering with effective radius[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018, 48(4):649-660.
[21] BRYANT A, CIOS K. RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(6):1109-1121.
[22] ZHU Qingsheng, FENG Ji, HUANG Jinlong. Natural neighbor: a self-adaptive neighborhood method without parameter K[J]. Pattern Recognition Letters, 2016, 80:30-36.
[23] HUANG Jinlong, ZHU Qingsheng, YANG Lijun, et al. A non-parameter outlier detection algorithm based on natural neighbor[J]. Knowledge-based Systems, 2016, 92:71-77.
[24] YANG Lijun, ZHU Qingsheng, HUANG Jinlong, et al. Adaptive edited natural neighbor algorithm[J]. Neurocomputing, 2017, 230:427-433.
[25] CHENG Dongdong, ZHU Qingsheng, HUANG Jinlong, et al. Natural neighbor-based clustering algorithm with local representatives[J]. Knowledge-based Systems, 2017, 123:238-253.
[26] 陈叶旺,申莲莲,钟才明,等. 密度峰值聚类算法综述[J]. 计算机研究与发展, 2020, 57(2):378-394. CHEN Yewang, SHEN Lianlian, ZHONG Caiming, et al. Survey on density peak clustering algorithm[J]. Journal of Computer Research and Development, 2020, 57(2):378-394.
[27] FU Limin, MEDICO E. Flame: a novel fuzzy clustering method for the analysis of DNA microarray data[J]. BMC Bioinformatics, 2007, 8(3):1-15.
[28] GIONIS A, MANNILA H, TSAPARAS P. Clustering aggregation[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1):1-30.
[29] JAIN A K, LAW M H C. Data clustering: a users dilemma[C] //International Conference on Pattern Recognition & Machine Intelligence. Kolkata: Springer, 2005:1-10.
[30] DING Shifei, DU Mingjing, SUN Tongfeng, et al. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowledge-based Systems, 2017, 133:294-313.
[31] HUBERT L, ARABIE P. Comparing partitions[J]. Journal of Classification, 1985, 2:193-218.
[32] CHANG H, YEUNG D Y. Robust path-based spectral clustering[J]. Pattern Recognition, 2008, 41(1):191-203.
[33] VEENMAN C J, REINDERS M J T, BACKER E. A maximum variance cluster algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9):1273-1280.
[34] ZAHN C T. Graph-theoretical methods for detecting and describing gestalt clusters[J]. IEEE Transactions on Computers, 1971, C-20(1):68-86.
[35] REZAE M, FRANTI P. Set matching measures for external cluster validity[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(8):2173-2186.
[1] 罗奇,苟刚. 基于聚类和群组归一化的多模态对话情绪识别[J]. 《山东大学学报(理学版)》, 2024, 59(7): 105-112.
[2] 朱金,付玉,管文瑞,王平心. 基于自然最近邻的样本扰动三支聚类[J]. 《山东大学学报(理学版)》, 2024, 59(5): 45-51.
[3] 郑晨颖,陈颖悦,侯贤宇,江连吉,廖亮. 一种邻域粒的模糊C均值聚类算法[J]. 《山东大学学报(理学版)》, 2024, 59(5): 35-44.
[4] 孙嘉睿,杜明晶. 模糊边界剥离聚类[J]. 《山东大学学报(理学版)》, 2024, 59(3): 27-36, 50.
[5] 罗兴隆,贺兴时,周洁,杨新社. 基于非洲秃鹫优化算法改进的密度峰值聚类[J]. 《山东大学学报(理学版)》, 2024, 59(1): 46-55,71.
[6] 金鑫,于非凡,戴雨桐,李兹谦,邹永魁. 基于聚类分析和鉴别信息的教学效果评价模型分析[J]. 《山东大学学报(理学版)》, 2023, 58(7): 115-120.
[7] 徐华畅,许倩,赵钰琳,梁峰宁,徐凯,朱红. 基于改进EfficientNetV2的脑胶质瘤IDH1突变状态预测方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 60-66.
[8] 马慧,魏立力. 基于犹豫三角模糊相关系数的聚类分析[J]. 《山东大学学报(理学版)》, 2023, 58(12): 118-126.
[9] 凡嘉琛,王平心,杨习贝. 基于三支决策的密度敏感谱聚类[J]. 《山东大学学报(理学版)》, 2023, 58(1): 59-66.
[10] 李心雨,范辉,刘惊雷. 基于自适应图调节和低秩矩阵分解的鲁棒聚类[J]. 《山东大学学报(理学版)》, 2022, 57(8): 21-38.
[11] 柳利芳,马园园. 基于多视角对称非负矩阵分解的跨模态信息检索方法[J]. 《山东大学学报(理学版)》, 2022, 57(7): 65-72.
[12] 孙林,梁娜,徐久成. 基于自适应邻域互信息与谱聚类的特征选择[J]. 《山东大学学报(理学版)》, 2022, 57(12): 13-24.
[13] 武祺然,周力凯,孙金金,王念鸽,余群芳. 浙江省空气质量变化特征研究——基于函数型数据分析[J]. 《山东大学学报(理学版)》, 2021, 56(7): 53-64.
[14] 杨婷,朱恒东,马盈仓,汪义瑞,杨小飞. 基于L2,1范数和流形正则项的半监督谱聚类算法[J]. 《山东大学学报(理学版)》, 2021, 56(3): 67-76.
[15] 张一鸣,王国胤,胡军,傅顺. 基于密度峰值和网络嵌入的重叠社区发现[J]. 《山东大学学报(理学版)》, 2021, 56(1): 91-102.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!