基于自然邻居搜索优化策略的密度峰值聚类算法

doi:10.6040/j.issn.1671-9352.7.2023.4130

摘要/Abstract

摘要： 结合自然邻居搜索算法改进了密度峰值聚类(clustering by fast search and find of density peaks, CFSFDP)算法存在的一系列问题,提出基于自然邻居搜索优化策略的密度峰值聚类(density peak clustering algorithm optimized by natural neighbor search, NaN-CFSFDP)算法。基于自然邻居搜索算法提出了一种离群样本的检测方法,针对CFSFDP算法中截断距离dc人工准确取值较难的问题,结合自然邻居搜索算法改进了d_c的计算方式,实现了d_c的自动取值。重新设计并统一了CFSFDP算法的样本密度度量规则,使其更关注每个样本的局部信息。由于数据集中因类簇间的密度差异大,密度峰值点集中于稠密簇使得簇丢失,因此提出样本共享自然邻居和类簇共享自然邻居的概念,构造新的类簇融合算法。合成数据集和真实数据集上的实验结果表明,在大多数情况下,NaN-CFSFDP算法在聚类性能上优于或至少与比较方法相当,且与CFSFDP算法及其改进算法相比参数更少。

关键词: 密度峰值, 自然邻居, 聚类, 类簇融合, 离群样本, 截断距离

Abstract: We combine the natural neighbor search algorithm to improve a series of problems of the density peaks clustering(CFSFDP)algorithm, and propose the NaN-CFSFDP algorithm. First, an outlier samples detection method is proposed based on the natural neighbor search algorithm. Then, for the problem that the truncation distance dc is difficult to be taken accurately manually in the CFSFDP algorithm, the calculation of d_c is improved in combination with the natural neighbor search algorithm, and the automatic taking of dc is realized. The metric rule of the sample density of the CFSFDP algorithm is redesigned and unified to make it pay more attention to the local information of each sample. Finally, to address the problem that the density peak points in the dataset may be concentrated in dense clusters due to the large density difference between clusters, which leads to cluster loss, the concepts of shared natural neighbors for samples and shared natural neighbors for clusters are proposed to construct a new cluster fusion algorithm. Experimental results on synthetic and real datasets show that the algorithm outperforms or is at least comparable to the comparative method in terms of clustering performance in most cases and has fewer parameters compared to CFSFDP algorithm and its improvements.

Key words: density peaks, natural neighbor, clustering, cluster fusion, outlier samples, truncation distance

中图分类号:

TP391

张春昊,解滨,徐童童,张喜梅. 基于自然邻居搜索优化策略的密度峰值聚类算法[J]. 《山东大学学报(理学版)》, 2025, 60(1): 29-44.

ZHANG Chunhao, XIE Bin, XU Tongtong, ZHANG Ximei. Density peak clustering algorithm optimized by natural neighbor search[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2025, 60(1): 29-44.

参考文献

[1] OKTAR Y, TURKAN M. A review of sparsity-based clustering methods[J]. Signal Processing, 2018, 148:20-30.
[2] SAXENA A, PRASAD M, GUPTA A, et al. A review of clustering techniques and developments[J]. Neurocomputing, 2017, 267:664-681.
[3] MACQUEEN J B. Some methods for classification and analysis of multivariate observations[C] //Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967:281-297.
[4] BEZDEK J C, EHRLICH R, FULL W. FCM: the fuzzy c-means clustering algorithm[J]. Computers and Geosciences, 1984, 10(2/3):191-203.
[5] WANG Wei, YANG Jiong, MUNTZ R. STING: a statistical information grid approach to spatial data mining[C] //Proceedings of the 23rd International Conference on Very Large Data Bases. Athens: ACM, 1997:186-195.
[6] GUHA S, RASTOGI R, SHIM K. Cure: an efficient clustering algorithm for large databases[J]. Information Systems, 2001, 26(1):35-58.
[7] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C] //Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Portland Oregon, AAAI, 1996:226-231.
[8] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[9] 谢娟英,高红超,谢维信. K近邻优化的密度峰值快速搜索聚类算法[J]. 中国科学(信息科学), 2016, 46(2):258-280. XIE Juanying, GAO Hongchao, XIE Weixin. K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset[J]. Scientia Sinica(Informationis), 2016, 46(2):258-280.
[10] XIE Juanying, GAO Hongchao, XIE Weixin, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information Sciences, 2016, 354:19-40.
[11] JIANG Jianhua, CHEN Yujun, MENG Xianqiu, et al. A novel density peaks clustering algorithm based on k nearest neighbors for improving assignment process[J]. Physica A, 2019, 523:702-713.
[12] ZHANG Rui, DU Tao, QU Shouning, et al. Adaptive density-based clustering algorithm with shared KNN conflict game[J]. Information Sciences, 2021, 565:344-369.
[13] LIU Yaohui, MA Zhengming, YU Fang. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy[J]. Knowledge-based Systems, 2017, 133:208-220.
[14] BAI Liang, CHENG Xueqi, LIANG Jiye, et al. Fast density clustering strategies based on the k-means algorithm[J]. Pattern Recognition, 2017, 71:375-386.
[15] LIU Rui, WANG Hong, YU Xiaomei. Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Information Sciences, 2018, 450:200-226.
[16] 张新元,贠卫国. 共享K近邻和多分配策略的密度峰值聚类算法[J]. 小型微型计算机系统, 2023, 44(1):75-82. ZHANG Xinyuan, YUN Weiguo. Sharing K-nearest neighbors and multiple assignment policies density peaks clustering algorithm[J]. Journal of Chinese Computer Systems, 2023, 44(1):75-82.
[17] ZHANG Chunhao, XIE Bin, ZHANG Yiran. Reverse-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Chinese Journal of Electronics, 2023, 32(6):1341-1354.
[18] 徐童童,解滨,张喜梅,等. 自适应聚类中心策略优化的密度峰值聚类算法[J]. 计算机工程与应用,2023,59(21):91-101. XU Tongtong, XIE Bin, ZHANG Ximei, et al. Density peak clustering algorithm optimized by adaptive clustering centers strategy[J]. Computer Engineering and Applications, 2023, 59(21):91-101.
[19] 张春昊,解滨,张喜梅,等. 一种结合自适应近邻与密度峰值的加权模糊聚类算法[J]. 小型微型计算机系统,2023, 44(9):1974-1982. ZHANG Chunhao, XIE Bin, ZHANG Ximei, et al. Weighted fuzzy clustering algorithm combining adaptive nearest neighbors and density peaks[J]. Journal of Chinese Computer Systems, 2023, 44(9):1974-1982.
[20] CHEN Yewang, TANG Shengyu, PEI Songwen. DHeat: a density heat-based algorithm for clustering with effective radius[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018, 48(4):649-660.
[21] BRYANT A, CIOS K. RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(6):1109-1121.
[22] ZHU Qingsheng, FENG Ji, HUANG Jinlong. Natural neighbor: a self-adaptive neighborhood method without parameter K[J]. Pattern Recognition Letters, 2016, 80:30-36.
[23] HUANG Jinlong, ZHU Qingsheng, YANG Lijun, et al. A non-parameter outlier detection algorithm based on natural neighbor[J]. Knowledge-based Systems, 2016, 92:71-77.
[24] YANG Lijun, ZHU Qingsheng, HUANG Jinlong, et al. Adaptive edited natural neighbor algorithm[J]. Neurocomputing, 2017, 230:427-433.
[25] CHENG Dongdong, ZHU Qingsheng, HUANG Jinlong, et al. Natural neighbor-based clustering algorithm with local representatives[J]. Knowledge-based Systems, 2017, 123:238-253.
[26] 陈叶旺,申莲莲,钟才明,等. 密度峰值聚类算法综述[J]. 计算机研究与发展, 2020, 57(2):378-394. CHEN Yewang, SHEN Lianlian, ZHONG Caiming, et al. Survey on density peak clustering algorithm[J]. Journal of Computer Research and Development, 2020, 57(2):378-394.
[27] FU Limin, MEDICO E. Flame: a novel fuzzy clustering method for the analysis of DNA microarray data[J]. BMC Bioinformatics, 2007, 8(3):1-15.
[28] GIONIS A, MANNILA H, TSAPARAS P. Clustering aggregation[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1):1-30.
[29] JAIN A K, LAW M H C. Data clustering: a users dilemma[C] //International Conference on Pattern Recognition & Machine Intelligence. Kolkata: Springer, 2005:1-10.
[30] DING Shifei, DU Mingjing, SUN Tongfeng, et al. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowledge-based Systems, 2017, 133:294-313.
[31] HUBERT L, ARABIE P. Comparing partitions[J]. Journal of Classification, 1985, 2:193-218.
[32] CHANG H, YEUNG D Y. Robust path-based spectral clustering[J]. Pattern Recognition, 2008, 41(1):191-203.
[33] VEENMAN C J, REINDERS M J T, BACKER E. A maximum variance cluster algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9):1273-1280.
[34] ZAHN C T. Graph-theoretical methods for detecting and describing gestalt clusters[J]. IEEE Transactions on Computers, 1971, C-20(1):68-86.
[35] REZAE M, FRANTI P. Set matching measures for external cluster validity[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(8):2173-2186.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed