您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2024, Vol. 59 ›› Issue (8): 118-126.doi: 10.6040/j.issn.1671-9352.0.2023.250

•   • 上一篇    

一种多因素融合的高效离群点检测方法

杨志强(),冯山*(),尹伊,吴慧佳   

  1. 四川师范大学数学科学学院,四川 成都 610068
  • 收稿日期:2023-06-05 出版日期:2024-08-20 发布日期:2024-07-31
  • 通讯作者: 冯山 E-mail:1601298260@qq.com;fengshanrq@sohu.com
  • 作者简介:杨志强(1998—),男,硕士,研究方向为粗糙集与数据挖掘. E-mail: 1601298260@qq.com
  • 基金资助:
    国家自然科学基金资助项目(61673285);国家自然科学基金资助项目(62306196);四川省自然科学基金资助项目(24NSFSC1566);四川省自然科学基金资助项目(24NSFSC1487)

An efficient outlier detection method based on multi-factor fusion

Zhiqiang YANG(),Shan FENG*(),Yi YIN,Huijia WU   

  1. College of Mathematics and Science, Sichuan Normal University, Chengdu 610068, Sichuan, China
  • Received:2023-06-05 Online:2024-08-20 Published:2024-07-31
  • Contact: Shan FENG E-mail:1601298260@qq.com;fengshanrq@sohu.com

摘要:

基于邻域粗糙集的对象邻域相对比和对象重要度等粒化特征,提出了改进的基于邻域粗糙熵的多因素融合的离群点检测(neighborhood rough entropy-based outlier,NREOD)算法。在加利福尼亚大学尔湾分校(University of CaliforniaIrvine,UCI)数据库的标准数据集上的对比实验表明,NREOD算法在不同类型的数据集的离群检测的误判率更低,并且有更好的适应性和有效性。此算法为混合型属性数据集的离群检测研究与应用提供了一条新的有效途径。

关键词: 数据挖掘, 离群点检测, 邻域粗糙集, 邻域粗糙熵, 多因素融合

Abstract:

Based on granularity characteristics of relative ratio of object neighborhood and importance of objects in neighborhood rough sets, an improved outlier detection method (neighborhood rough entropy-based outlier, NREOD) based on neighborhood rough entropy and multi-factor fusion is proposed. Comparison experiments on standard data sets in University of CaliforniaIrvine (UCI) databases show that the NREOD algorithm has a lower false positive rate for outlier detection in different types of data sets, and has better adaptability and effectiveness. This algorithm provides a new effective way for the research and application of outlier detection in mixed attribute data sets.

Key words: data mining, outlier detection, neighborhood rough set, neighborhood rough entropy, multi-factor fusion

中图分类号: 

  • TP181

图1

融合邻域相对比的邻域粗糙熵离群因子度量结构"

图2

NREOD算法流程图"

表1

UG:|UG|=714,|RUG|=14时,German数据集的对比实验结果"

K/% k NREOD算法 KNN算法 DIS算法 FindCBLOF算法 SEQ算法 IE算法 NGOD算法 OD_NGE算法
T1/% t1 T2/% t2 T3/% t3 T4/% t4 T5/% t5 T6/% t6 T7/% t7 T8/% t8
0.98 7 14.29 2 28.57 4 14.29 2 21.43 3 14.29 2 21.43 3 21.43 3 21.43 3
1.96 14 35.71 5 28.57 4 21.43 3 35.71 5 35.71 5 28.57 4 42.86 6 35.71 5
5.04 36 85.70 12 57.14 8 35.71 5 50.00 7 64.29 9 57.14 8 92.86 13 71.43 10
6.02 43 100.00 14 57.14 8 42.85 6 71.43 10 64.29 9 71.43 10 92.86 13 78.57 11
7.56 54 100.00 14 64.29 9 64.29 9 100.00 14 71.43 10 92.86 13 100.00 14 85.71 12
11.20 80 100.00 14 100.00 14 78.57 11 100.00 14 71.43 10 100.00 14 100.00 14 92.86 13
16.95 121 100.00 14 100.00 14 100.00 14 100.00 14 85.71 12 100.00 14 100.00 14 100.00 14
24.37 174 100.00 14 100.00 14 100.00 14 100.00 14 100.00 14 100.00 14 100.00 14 100.00 14

表2

UW: |UW|=483, |RUW|=39时,WBC数据集的对比实验结果"

K/% k NREOD算法 KNN算法 DIS算法 FindCBLOF算法 SEQ算法 IE算法 OD_NGE算法 NGOD算法
T1/% t1 T2/% t2 T3/% t3 T4/% t4 T5/% t5 T6/% t6 T7/% t7 T8/% t8
0.83 4 10.26 4 10.26 4 10.26 4 10.26 4 7.69 3 10.26 4 10.26 4 10.26 4
1.66 8 20.51 8 20.51 8 12.82 5 17.95 7 17.95 7 17.95 7 20.51 8 20.51 8
3.31 16 41.03 16 41.03 16 28.21 11 35.90 14 35.90 14 38.46 15 38.46 15 41.03 16
4.97 24 61.54 24 51.28 20 46.15 18 53.85 21 53.85 21 53.85 21 58.97 23 58.97 23
6.63 32 74.36 29 69.23 27 61.54 24 69.23 27 71.79 28 71.79 28 76.92 30 74.36 29
8.28 40 87.17 34 82.05 32 74.36 29 82.05 32 82.05 32 84.61 33 87.17 34 87.17 34
10.14 49 100.00 39 94.87 37 92.31 36 89.74 35 89.74 35 92.31 36 97.44 38 97.44 38
11.59 56 100.00 39 100.00 39 100.00 39 97.44 38 100.00 39 100.00 39 100.00 39 100.00 39
13.25 64 100.00 39 100.00 39 100.00 39 100.00 39 100.00 39 100.00 39 100.00 39 100.00 39

表3

UL:|UL|=148,|RUL|=6时,Lymphography数据集的对比实验结果"

K/% k NREOD算法 KNN算法 DIS算法 FindCBLOF算法 SEQ算法 IE算法 NGOD算法 OD_NGE算法
T1/% t1 T2/% t2 T3/% t3 T4/% t4 T5/% t5 T6/% t6 T7/% t7 T8/% t8
4.05 6 83.33 5 66.67 4 66.67 4 66.67 4 14.29 2 83.33 5 83.33 5 83.33 5
4.73 7 100.00 6 66.67 4 83.33 5 66.67 4 83.33 5 83.33 5 100.00 6 83.33 5
5.41 8 100.00 6 83.33 5 83.33 5 66.67 4 83.33 5 100.00 6 100.00 6 83.33 5
6.08 9 100.00 6 100.00 6 83.33 5 66.67 4 83.33 5 100.00 6 100.00 6 100.00 6
8.11 12 100.00 6 100.00 6 100.00 6 66.67 4 100.00 6 100.00 6 100.00 6 100.00 6
13.51 20 100.00 6 100.00 6 100.00 6 66.67 4 100.00 6 100.00 6 100.00 6 100.00 6
20.27 30 100.00 6 100.00 6 100.00 6 100.00 6 100.00 6 100.00 6 100.00 6 100.00 6
1 梅林, 张凤荔, 高强. 离群点检测技术综述[J]. 计算机应用研究, 2020, 37 (12): 3521- 3527.
MEI Lin , ZHANG Fenli , GAO Qiang . Overview of outlier detection technology[J]. Application Research of Computers, 2020, 37 (12): 3521- 3527.
2 ROUSSEEUW P J , LEROY A M . Robust regression and outlier detection[M]. New York: Wiley, 1987: 1- 18.
3 KNORR E M, NG R T. A unified notion of outliers: properties and computation[C]//Knowledge Discovery and Data Mining. Montreal: IEEE, 1997: 219-222.
4 KNORR E M , NG R T , TUCAKOV V . Distance-based outliers: algorithms and applications[J]. The International Journal on Very Large Data Bases, 2000, 8 (3): 237- 253.
5 BREUNIG MM, KRIEGEL H P, NGR T, et al. LOF: identifying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Dallas: IEEE, 2000: 93-104.
6 JAIN A K , MURTY M N , FLYNN P J . Data clustering: a review[J]. ACM Computing Surveys (CSUR), 1999, 31 (3): 264- 323.
doi: 10.1145/331499.331504
7 PAWLAK Z . Rough sets: theoretical aspects of reasoning about data[M]. New York: Kluwer Academic Publishers, 1992.
8 徐波, 冯山. 基于邻域关系矩阵的属性约简算法[J]. 小型微型计算机系统, 2019, 40 (8): 1595- 1600.
doi: 10.3969/j.issn.1000-1220.2019.08.003
XU Bo , FENG Shan . Attribute reduction algorithm based on neighborhood relationship matrix[J]. Journal of Chinese Computer Systems, 2019, 40 (8): 1595- 1600.
doi: 10.3969/j.issn.1000-1220.2019.08.003
9 杨晓玲, 张贤勇. 基于邻域粗糙隶属函数的离群点检测[J]. 计算机工程与设计, 2019, 40 (2): 533- 539.
YANG Xiaoling , ZHANG Xianyong . Outlier detection based on neighborhood rough membership function[J]. Computer Engineering and Design, 2019, 40 (2): 533- 539.
10 SHANNON C E . A mathematical theory of communication[J]. The Bell System Technical Journal, 1948, 27 (3): 379- 423.
doi: 10.1002/j.1538-7305.1948.tb01338.x
11 谭阳. 基于粗糙熵的渐进式离群点检测方法研究[D]. 成都: 四川师范大学, 2021.
TAN Yang. Research on progressive outlier detection based on rough entropy[D]. Chengdu: Sichuan Normal University, 2021.
12 付沙, 肖叶枝, 周航军. 基于粗糙集理论的高校教师评价体系研究[J]. 山西档案, 2019, (1): 174- 178.
doi: 10.3969/j.issn.1005-9652.2019.01.039
FU Sha , XIAO Yezhi , ZHOU Hangjun . Research on the evaluation system of university teachers based on rough set theory[J]. Shanxi Archives, 2019, (1): 174- 178.
doi: 10.3969/j.issn.1005-9652.2019.01.039
13 李虹欣. 基于条件熵的邻域粗糙集属性约简算法及其应用[D]. 大连: 大连交通大学, 2021.
LI Hongxin. Attribute reduction algorithm of neighborhood rough set based on conditional entropy and its application[D]. Dalian: Dalian Jiaotong University, 2021.
14 阳恋, 冯山. 一般二元关系中基于边界域的知识粗糙熵与粗集粗糙熵[J]. 四川师范大学学报(自然科学版), 2008, 31 (3): 273- 277.
doi: 10.3969/j.issn.1001-8395.2008.03.005
YANG Lian , FENG Shan . Rough entropies of knowledge and rough set based on boundary region of general binary relation[J]. Journal of Sichuan Normal University (Natural Science), 2008, 31 (3): 273- 277.
doi: 10.3969/j.issn.1001-8395.2008.03.005
15 杨洁, 王国胤, 李帅. 基于边界域的邻域知识距离度量模型[J]. 计算机科学, 2020, 47 (3): 61- 66.
YANG Jie , WANG Guoyin , LI Shuai . Neighborhood knowledge distance measurement model based on boundary region[J]. Computer Science, 2020, 47 (3): 61- 66.
16 李毅, 胡建成. 一种面向混合属性数据的邻域粒离群点检测[J]. 小型微型计算机系统, 2020, 41 (4): 855- 860.
doi: 10.3969/j.issn.1000-1220.2020.04.032
LI Yi , HU Jiancheng . Outlier detection based on neighborhood granule for mixed attribute data[J]. Journal of Chinese Computer Systems, 2020, 41 (4): 855- 860.
doi: 10.3969/j.issn.1000-1220.2020.04.032
17 张玉婷, 冯山. 一种基于邻域近似精度的离群点检测方法[J]. 数据采集与处理, 2022, 37 (5): 1018- 1025.
ZHANG Yuting , FENG Shan . An outlier detection method based on neighborhood approximation accuracy[J]. Journal of Data Acquisition and Processing, 2022, 37 (5): 1018- 1025.
18 段珣, 杨志勇, 江峰. 一种基于邻域粒度熵的离群点检测算法[J]. 计算机与现代化, 2022, 38 (10): 19- 23.
doi: 10.3969/j.issn.1006-2475.2022.10.004
DUAN Xun , YANG Zhiyong , JIANG Feng . An outlier detection algorithm based on neighborhood granularity entropy[J]. Computer and Modernization, 2022, 38 (10): 19- 23.
doi: 10.3969/j.issn.1006-2475.2022.10.004
19 YUAN Zhong , ZHANG Xianyong , FENG Shan . Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures[J]. Expert Systems with Applications, 2018, 112, 243- 257.
doi: 10.1016/j.eswa.2018.06.013
20 刘意如. 基于邻域粗糙隶属度和邻域类熵的序列离群点检测研究[D]. 成都: 四川师范大学, 2021.
LIU Yiru. Sequence outlier detection based on neighborhood rough membership degree and neighborhood class entropy[D]. Chengdu: Sichuan Normal University, 2021.
21 CHEN Yumin , XUE Yu , MA Ying , et al. Measures of uncertainty for neighborhood rough sets[J]. Knowledge-based Systems, 2017, 120, 226- 235.
doi: 10.1016/j.knosys.2017.01.008
22 RAMASWAMY S, RASTOGI R, SHIM K. Efficient algorithms for mining outliers from large data sets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: IEEE, 2000: 427-438.
23 KNORR E M, NG R T. Algorithms for mining distance based outliers in large datasets[C]//Proceedings of the 24th International Conference on Very Large Data Bases. New York: IEEE, 1998: 392-403.
24 HE Zengyou , XU Xiaofeng , DENG Shengchun . Discovering cluster-based local outliers[J]. Pattern Recognition Letters, 2003, 24 (9/10): 1641- 1650.
25 JIANG Feng , SUI Yuefei , CAO Cungen . Some issues about outlier detection in rough set theory[J]. Expert Systems with Applications, 2009, 36 (3): 4680- 4687.
doi: 10.1016/j.eswa.2008.06.019
26 JIANG Feng , SUI Yuefei , CAO Cungen . An information entropy-based approach to outlier detection in rough sets[J]. Expert Systems with Applications, 2010, 37 (9): 6338- 6344.
doi: 10.1016/j.eswa.2010.02.087
27 AGGARWAL C C, YU P S. Outlier detection for high dimensional data[C]//Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Santa Barbara California: IEEE, 2001: 37-46.
28 HAWKINS S, HE H, WILLIAMS G, et al. Outlier detection using replicator neural networks[C]//International Conference on Data Warehousing and Knowledge Discovery. Berlin: Springer, 2002: 170-180.
[1] 温欣,李德玉. 基于属性加权的ML-KNN方法[J]. 《山东大学学报(理学版)》, 2024, 59(3): 107-117.
[2] 胡成祥,张莉,黄晓玲,王汇彬. 面向属性变化的动态邻域粗糙集知识更新方法[J]. 《山东大学学报(理学版)》, 2023, 58(7): 37-51.
[3] 时俊鹏,张燕兰. 面向对象删除的局部邻域粗糙集动态更新算法[J]. 《山东大学学报(理学版)》, 2023, 58(5): 17-25.
[4] 刘长顺,刘炎,宋晶晶,徐泰华. 基于论域离散度的属性约简算法[J]. 《山东大学学报(理学版)》, 2023, 58(5): 26-35.
[5] 孙林,梁娜,徐久成. 基于自适应邻域互信息与谱聚类的特征选择[J]. 《山东大学学报(理学版)》, 2022, 57(12): 13-24.
[6] 张超,梁英,方浩汕. 支持隐私保护的社交网络信息推荐方法[J]. 《山东大学学报(理学版)》, 2020, 55(3): 9-18.
[7] 谢小杰,梁英,董祥祥. 社交网络用户敏感属性迭代识别方法[J]. 《山东大学学报(理学版)》, 2019, 54(3): 10-17, 27.
[8] 康海燕,马跃雷. 差分隐私保护在数据挖掘中应用综述[J]. 山东大学学报(理学版), 2017, 52(3): 16-23.
[9] 柳欣,徐秋亮,张波. 满足可控关联性的合作群签名方案[J]. 山东大学学报(理学版), 2016, 51(9): 18-35.
[10] 张凌, 任雪芳. 基数余-亏定理与数据外-内挖掘-分离[J]. 山东大学学报(理学版), 2015, 50(08): 90-94.
[11] 吴熙曦, 李炳龙, 张天琪. 基于KNN的Android智能手机微信取证方法[J]. 山东大学学报(理学版), 2014, 49(09): 150-153.
[12] 张文东1,尹金焕1,贾晓飞2,黄超1,苑衍梅1. 基于向量的频繁项集挖掘算法研究[J]. J4, 2011, 46(3): 31-34.
[13] 朱国红 石冰 邢晓娜. 基于特征点选择的聚类算法研究[J]. J4, 2009, 44(9): 40-42.
[14] 娄兰芳,潘庆先 . 基于集合运算的频繁集挖掘优化算法[J]. J4, 2008, 43(11): 54-57 .
[15] 闫宗奎,石 冰 . 基于网格模型的孤立点检测算法[J]. J4, 2008, 43(11): 58-60 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 金黎明,杨 艳*,刘万顺,韩宝芹,田文杰,范圣第 . 壳寡糖及其衍生物对CCl4诱导的小鼠肝损伤的保护作用[J]. J4, 2007, 42(7): 1 -04 .
[2] 章东青,殷晓斌,高汉鹏. Quasi-线性Armendariz模[J]. 山东大学学报(理学版), 2016, 51(12): 1 -6 .
[3] 曲晓英,赵 静 . 含时线性Klein-Gordon方程的解[J]. J4, 2007, 42(7): 22 -26 .
[4] 王光臣 . 部分可观测信息下的线性二次非零和随机微分对策[J]. J4, 2007, 42(6): 12 -15 .
[5] 张申贵. 局部超线性p-基尔霍夫方程的多重解[J]. 山东大学学报(理学版), 2014, 49(05): 61 -68 .
[6] 吴春雪 . Musielak-Orlicz 序列空间的WNUS性质[J]. J4, 2007, 42(3): 18 -22 .
[7] 杨军. 金属基纳米材料表征和纳米结构调控[J]. 山东大学学报(理学版), 2013, 48(1): 1 -22 .
[8] 董伟伟. 一种具有独立子系统的决策单元DEA排序新方法[J]. J4, 2013, 48(1): 89 -92 .
[9] 裴胜玉,周永权*. 一种基于混沌变异的多目标粒子群优化算法[J]. J4, 2010, 45(7): 18 -23 .
[10] 罗斯特,卢丽倩,崔若飞,周伟伟,李增勇*. Monte-Carlo仿真酒精特征波长光子在皮肤中的传输规律及光纤探头设计[J]. J4, 2013, 48(1): 46 -50 .