您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2024, Vol. 59 ›› Issue (3): 27-36, 50.doi: 10.6040/j.issn.1671-9352.4.2023.040

•   • 上一篇    下一篇

模糊边界剥离聚类

孙嘉睿(),杜明晶*()   

  1. 江苏师范大学计算机科学与技术学院,江苏 徐州 221100
  • 收稿日期:2023-06-02 出版日期:2024-03-20 发布日期:2024-03-06
  • 通讯作者: 杜明晶 E-mail:sunjr@jsnu.edu.cn;dumj@jsnu.edu.cn
  • 作者简介:孙嘉睿(1997—), 男, 硕士研究生, 研究方向为机器学习和数据挖掘. E-mail: sunjr@jsnu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(62006104);江苏师范大学研究生科研创新项目(2022XKT1528)

Fuzzy border-peeling clustering

Jiarui SUN(),Mingjing DU*()   

  1. School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221100, Jiangsu, China
  • Received:2023-06-02 Online:2024-03-20 Published:2024-03-06
  • Contact: Mingjing DU E-mail:sunjr@jsnu.edu.cn;dumj@jsnu.edu.cn

摘要:

提出了一种模糊边界剥离聚类(fuzzy border-peeling clustering, FBP)算法。首先, 采用了一种基于Cauchy核的动态密度估计方式来计算数据点密度; 然后, 使用逐层剥离策略区分边界数据和核心数据; 接着, 利用核心数据间的可达性实现核心区域聚类; 最后, 采用模糊分配策略实现边界数据的软划分。在人工数据集和真实数据集上与10种算法(包含6种密度聚类算法和4种模糊聚类算法)作了对比。实验结果表明, 在所有数据集上, FBP的调整兰德系数ARI指标平均提高了21%~60%, FBP的标准化互信息NMI指标平均提升了12%~47%, 基于Cauchy核和模糊分配策略优化后的边界剥离聚类算法显著提高了聚类的准确性。

关键词: 密度聚类, 边界剥离聚类, 模糊聚类, 软化分, 柯西核函数

Abstract:

A fuzzy border-peeling clustering (FBP) algorithm is proposed. First, a density estimation method based on Cauchy kernel is used to calculate the densities of data points. Secondly, the boundary data are separated from the core data using the layer-by-layer peeling strategy. Thirdly, the reachability between the core data is used to achieve the core region clustering. Finally, a fuzzy assignment strategy is used to achieve the soft partitioning of the boundary data. A comparison is made between the fuzzy border-peeling clustering and 10 benchmark algorithms, including 6 density-based clustering algorithms and 4 fuzzy clustering algorithms, on artificial and real-world datasets. The experimental results show that on all datasets, FBP has the ARI (adjusted rand index) increased by 21% to 60% on average, and FBP has the NMI (normalized mutual information) increased by 12% to 47% on average. The border-peeling clustering algorithm optimized based on Cauchy kernel and fuzzy assignment strategy significantly improves the accuracy of clustering.

Key words: density-based clustering, border-peeling clustering, fuzzy clustering, soft clustering, Cauchy kernel function

中图分类号: 

  • TP18

表1

符号摘要"

符号 概念
n 数据点数目
k 近邻数目
t t次剥离
xi 任意数据点
X 数据集合
X(t) t次剥离, 剩余数据集合
NNk(t) (xi) t次剥离, 数据点xi的第k近邻
Nk(t) (xi) t次剥离, 数据点xik近邻集合
RNk(t) (xi) t次剥离, 数据点xi的反k近邻集合

图1

FBP聚类算法流程"

表2

实验数据集"

数据集名称 数据量 维数 类数
Jain 373 2 2
DD 1 300 2 3
Pathbased 300 2 3
Handl 715 2 3
SF 4 096 2 4
T4 7 236 2 6
Glass 214 10 6
Control 600 60 6
Dig 1 797 64 10
PageB 5 473 10 5
Optdigits 5 620 64 10
Crowd 10 845 28 6

图2

人工数据集上的聚类结果"

表3

人工数据集上的性能评估"

AlgorithmJainDD
ARI NMI params ARI NMI params
FBP 1.000 1.000 k=19 0.992 0.980 k=24
BP 0.478 0.677 k=25 0.154 0.530 k=25
DBSCAN 1.000 1.000 ε=3.1, MinPts=28 0.962 0.856 ε=0.1, MinPts=5
HDBSCAN 0.209 0.372 k=5, mc=5 0.527 0.488 k=28, mc=20
DPC 0.334 0.617 dc=6%, α=2.5 0.256 0.619 dc=4%, α=2
DPC-CE 0.304 0.609 dc=2%, α=2.5 0.214 0.589 dc=4%, α=2
VDPC 1.000 1.000 dc=5%, δ=5 0.302 0.509 dc=3%, δ=1
FCM 0.318 0.367 M=1.5 0.213 0.277 M=4
kFCM 0.239 0.323 M=3.5, δ=2 0.204 0.299 M=2, δ=2
MkFC 0.421 0.448 M=1.5 0.221 0.245 M=3.5
PCM 0.272 0.362 M=3.5 0.211 0.287 M=2
AlgorithmPathbasedHandl
ARI NMI paraMs ARI NMI params
FBP 1.000 1.000 k=9 0.988 0.974 k=30
BP 0.980 0.969 k=9 0.659 0.785 k=30
DBSCAN 0.901 0.871 ε=3.35, MinPts=30 0.720 0.668 ε=0.1, MinPts=5
HDBSCAN 0.016 0.126 k=5, mc=20 0.250 0.358 k=9, mc=15
DPC 0.342 0.556 dc=2%, α=0.5 0.844 0.828 dc=5%, α=1.5
DPC-CE 0.520 0.611 dc=4%, α=1 0.762 0.818 dc=8%, α=2.5
VDPC 0.661 0.707 dc=2%, δ=2 0.762 0.818 dc=5%, δ=1
FCM 0.465 0.550 M=1.5 0.625 0.654 M=1.5
KFCM 0.454 0.545 M=1.5, δ=1 0.560 0.672 M=3, δ=0.5
MKFC 0.426 0.497 M=4 0.568 0.623 M=2
PCM 0.409 0.513 M=4 0.330 0.437 M=4
AlgorithmSFT4
ARI NMI params ARI NMI params
FBP 0.978 0.953 k=23 1.000 1.000 k=26
BP 0.201 0.599 k=30 0.361 0.737 k=50
DBSCAN 0.135 0.490 ε=0.1, MinPts=5 0.999 0.997 ε=0.1, MinPts=16
HDBSCAN 0.127 0.304 k=21, mc=20 0.271 0.518 k=46, mc=20
DPC 0.553 0.754 dc=4%, α=2.5 0.828 0.883 dc=5%, α=2.5
DPC-CE 0.346 0.662 dc=6%, α=2.5 0.541 0.798 dc=5%, α=2.5
VDPC 0.894 0.898 dc=2%, δ=2 0.724 0.826 dc=2%, δ=1
FCM 0.452 0.495 M=1.5 0.634 0.724 M=4
KFCM 0.542 0.560 M=2.5, δ=1.5 0.644 0.734 M=3, δ=1.5
MKFC 0.423 0.474 M=4 0.432 0.533 M=4
PCM 0.329 0.416 M=4 0.372 0.601 M=2.5

表4

真实数据集上的性能评估"

AlgorithmGlassControl
ARI NMI params ARI NMI params
FBP 0.693 0.769 k=6 0.680 0.856 k=19
BP 0.393 0.670 k=5 0.626 0.805 k=10
DBSCAN 0.575 0.678 ε=4.35, MinPts=9 0.136 0.480 ε=1.35, MinPts=30
HDBSCAN 0.255 0.370 k=5, mc=20 0.236 0.476 k=13, mc=20
DPC 0.672 0.753 dc=10%, α=2.5 0.540 0.741 dc=10%, α=2.5
DPC-CE 0.642 0.765 dc=8%, α=2 0.543 0.744 dc=9%, α=2.5
VDPC 0.363 0.713 dc=5%, δ=5 0.554 0.681 dc=10%, δ=1
FCM 0.570 0.762 m=2 0.619 0.732 m=1.5
KFCM 0.201 0.467 m=1.5, δ=2 0.660 0.754 m=1.5, δ=2
MKFC 0.242 0.344 m=1.5 0.646 0.763 m=1.5
PCM 0.282 0.464 m=2 0.536 0.686 m=1.5
AlgorithmDigPageB
ARI NMI params ARI NMI params
FBP 0.795 0.858 k=9 0.413 0.305 k=20
BP 0.397 0.698 k=10 0.095 0.210 k=23
DBSCAN 0.557 0.741 ε=1.35, MinPts=4 0.434 0.248 ε=0.1, MinPts=50
HDBSCAN 0.355 0.675 k=2, mc=5 0.318 0.130 k=29, mc=20
DPC 0.781 0.849 dc=2%, α=2.5 0.335 0.213 dc=8%, α=1
DPC-CE 0.714 0.826 dc=4%, α=2.5 0.384 0.255 dc=8%, α=1
VDPC 0.180 0.597 dc=2%, δ=2 0.012 0.050 dc=10%, δ=1
FCM 0.251 0.464 M=1.5 0.091 0.142 M=1.5
KFCM 0.452 0.559 M=2.5, δ=1 0.076 0.147 M=1.5, δ=1.5
MKFC 0.000 0.000 M=1.5 0.049 0.090 M=1.5
PCM 0.254 0.432 M=2.5 0.036 0.052 M=1.5
AlgorithmOptdigitsCrowd
ARI NMI params ARI NMI params
FBP 0.832 0.877 k=12 0.425 0.403 k=46
BP 0.095 0.210 k=23 0.354 0.349 k=46
DBSCAN 0.532 0.732 ε=1.35, MinPts=23 0.263 0.339 ε=0.35, MinPts=35
HDBSCAN 0.409 0.653 k=2, mc=5 0.285 0.347 k=41, mc=5
DPC 0.736 0.808 dc=2%, α=2.5 0.006 0.365 dc=3%, α=1
DPC-CE 0.723 0.812 dc=4%, α=2.5 0.031 0.371 dc=2%, α=1
VDPC 0.141 0.576 dc=2%, δ=2 0.000 0.000 dc=2%, δ=2
FCM 0.249 0.408 M=3.5 0.106 0.242 M=1.5
KFCM 0.356 0.528 M=2.5, δ=1 0.121 0.246 M=1.5, δ=2
MKFC 0.000 0.000 M=1.5 0.174 0.258 M=1.5
PCM 0.246 0.414 M=2 0.171 0.114 M=1.5
1 徐晓, 丁世飞, 丁玲. 密度峰值聚类算法研究进展[J]. 软件学报, 2022, 33 (5): 1800- 1816.
XU Xiao , DING Shifei , DING Ling . Survey on density peaks clustering algorithm[J]. Journal of Software, 2022, 33 (5): 1800- 1816.
2 PENG D , GUI Z , WANG D , et al. Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity[J]. Nature Communications, 2022, 13 (1): 5455.
doi: 10.1038/s41467-022-33136-9
3 ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press, 1996: 226-231.
4 CAMPELLO R J, MOULAVI D, SANDER J. Density-based clustering based on hierarchical density estimates[C]//Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer, 2013: 160-172.
5 RODRIGUEZ A , LAIO A . Clustering by fast search and find of density peaks[J]. Science, 2014, 344 (6191): 1492- 1496.
doi: 10.1126/science.1242072
6 CHEN H , LIANG M , LIU W , et al. An approach to boundary detection for 3D point clouds based on dbscan clustering[J]. Pattern Recognition, 2022, 124, 108431.
doi: 10.1016/j.patcog.2021.108431
7 OUYANG T , PEDRYCZ W , PIZZI N J . Rule-based modeling with dbscan-based information granules[J]. IEEE Transactions on Cybernetics, 2021, 51 (7): 3653- 3663.
doi: 10.1109/TCYB.2019.2902603
8 LU J , ZHAO Y , TAN K L , et al. Distributed density peaks clustering revisited[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34 (8): 3714- 3726.
doi: 10.1109/TKDE.2020.3034611
9 RASOOL Z , ZHOU R , CHEN L , et al. Index-based solutions for efficient density peak clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34 (5): 2212- 2226.
doi: 10.1109/TKDE.2020.3004221
10 盛锦超, 杜明晶, 李宇蕊, 等. 结合柯西核的分类型数据密度峰值聚类算法[J]. 计算机工程与应用, 2022, 58 (18): 162- 171.
SHENG Jingchao , DU Mingjing , LI Yurui , et al. Cauchy kernel-based density peaks clustering algorithm for categorical data[J]. Computer Engineering and Applications, 2022, 58 (18): 162- 171.
11 AVERBUCH-ELOR H , BAR N , COHEN-OR D . Border-peeling clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42 (7): 1791- 1797.
doi: 10.1109/TPAMI.2019.2924953
12 ANKERST M, BREUNIG M M, KRIEGEL H P, et al. Optics: ordering points to identify the clustering structure[C]//Proceedings of the 1999 International Conference on Management of Data. New York: ACM, 1999: 49-60.
13 GUO W , WANG W , ZHAO S , et al. Density peak clustering with connectivity estimation[J]. Knowledge-Based Systems, 2022, 243, 108501.
doi: 10.1016/j.knosys.2022.108501
14 WANG Y , WANG D , ZHOU Y , et al. VDPC: variational density peak clustering algorithm[J]. Information Sciences, 2023, 621, 627- 651.
doi: 10.1016/j.ins.2022.11.091
15 DU M , WANG R , JI R , et al. Robp a robust border-peeling clustering using Cauchy kernel[J]. Information Sciences, 2021, 571, 375- 400.
doi: 10.1016/j.ins.2021.04.089
16 陈延伟, 赵兴旺. 基于边界点检测的变密度聚类算法[J]. 计算机应用, 2022, 42 (8): 2450- 2460.
CHEN Yanwei , ZHAO Xingwang . Varied density clustering algorithm based on border point detection[J]. Journal of Computer Applications, 2022, 42 (8): 2450- 2460.
17 张柏恺, 杨德刚, 冯骥. 一种去除聚类数量k和邻域参数c设置的自适应聚类算法[J]. 计算机工程与科学, 2021, 43 (10): 1838- 1847.
doi: 10.3969/j.issn.1007-130X.2021.10.018
ZHANG Bokai , YANG Degang , FENG Ji . A self-adaptive clustering algorithm without neighborhood parameter k and cluster number c[J]. Computer Engineering & Science, 2021, 43 (10): 1838- 1847.
doi: 10.3969/j.issn.1007-130X.2021.10.018
18 DUNN J C . A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters[J]. Journal of Cybernetics, 1973, 3, 32- 57.
19 KARAYIANNIS N B. Meca: maximum entropy clustering algorithm[C]//Proceedings of 3rd IEEE International Conference on Fuzzy Systems. Piscataway: IEEE, 1994: 630-635.
20 GAN G, WU J, YANG Z. A fuzzy subspace algorithm for clustering high dimensional data[C]//Proceedings of 2nd International Conference on Advanced Data Mining and Applications. Berlin: Springer, 2006: 271-278.
21 FAZENDEIRO P , DE OLIVEIRA J V . Observer-biased fuzzy clustering[J]. IEEE Transactions on Fuzzy Systems, 2014, 23 (1): 85- 97.
22 KRISHNAPURAM R , KELLER J M . The possibilistic c-means algorithm: insights and recommendations[J]. IEEE Transactions on Fuzzy Systems, 1996, 4 (3): 385- 393.
23 ZHANG D Q , CHEN S C . Clustering incomplete data using kernel-based fuzzy c-means algorithm[J]. Neural Processing Letters, 2003, 18 (3): 155- 162.
24 HUANG H C , CHUANG Y Y , CHEN C S . Multiple kernel fuzzy clustering[J]. IEEE Transactions on Fuzzy Systems, 2011, 20 (1): 120- 134.
25 DING S , DU M , SUN T , et al. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowledge-Based Systems, 2017, 133, 294- 313.
[1] 万润泽1,雷建军1,袁操2. 基于模糊聚类理论的无线传感器节点休眠优化策略[J]. J4, 2013, 48(09): 17-21.
[2] 付雪峰,刘邱云,王明文 . 基于互信息的粗糙集信息检索模型[J]. J4, 2006, 41(3): 116-119 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 梁霄, 王林山 . 一类S分布时滞递归神经网络的全局吸引子[J]. J4, 2009, 44(4): 57 -60 .
[2] 董新梅 . 关于Suryanarayana的若干问题[J]. J4, 2007, 42(2): 83 -86 .
[3] 宋玉丹,王士同*. 基于特征缺省的最小类内方差支持向量机[J]. J4, 2010, 45(7): 102 -107 .
[4] 史艳华1,石东洋2*. 伪双曲方程类Wilson非协调元逼近[J]. J4, 2013, 48(4): 77 -84 .
[5] 刘汝军,曹玉霞,周 平 . 利用小反馈实现离散非线性混沌系统的反控制[J]. J4, 2007, 42(7): 30 -32 .
[6] 张雪凤1,刘鹏1,2. 基于类间差异最大化的加权距离改进K-means算法[J]. J4, 2010, 45(7): 28 -33 .
[7] 吴大华,何振峰*. 对基于聚类和遗传算法的时间序列分割算法的改进[J]. J4, 2010, 45(7): 45 -49 .
[8] 王贵杰 王文德 姚德利. CAD技术在道路平面交叉口竖向设计中的应用[J]. J4, 2009, 44(11): 79 -82 .
[9] 许春华,高宝玉*,卢磊,徐世平,曹百川,岳钦艳,张建 . 城市纳污河道废水化学强化一级处理的研究[J]. J4, 2006, 41(2): 116 -120 .
[10] 陈宏宇1, 张丽2. 不含弦5-圈和弦6-圈的平面图的线性2荫度[J]. 山东大学学报(理学版), 2014, 49(06): 26 -30 .