山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (7): 11-17.doi: 10.6040/j.issn.1671-9352.1.2015.060
刘驰,闫宏飞
LIU Chi, YAN Hong-fei
摘要: 区别于传统计算网页文本相似度的去重方法,以多媒体数据文件为主的云盘资源仅可利用相当有限的元信息进行检索结果去重。针对这一问题,以搭建的面向云盘资源数据的搜索引擎系统为基础,通过对云盘资源元信息特性的分析,发现除名称之外,资源文件后缀名、占用空间大小、资源的用户归属是判定重复记录的有效特征。在此基础上,给出了处理上述特征的归一化方法,进而使用无监督方法进行去重。实验结果表明,该方法能够有效对云盘资源检索结果去重。
中图分类号:
[1] RISTAD E S, YIANILOS P N. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5):522-532. [2] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection: a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16. [3] GIBSON D, PUNERA K, TOMKINS A. The volume and evolution of web page templates[C] //Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005: 830-839. [4] FETTERLY D, MANASSE M, NAJORK M. Detecting phrase-level duplication on the world wide web[C] //Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2005: 170-177. [5] MANKU G S, JAIN A, DAS S A. Detecting near-duplicates for web crawling[C] //Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 141-150. [6] 陈基漓,牛秦洲.基于特征码的网页去重[J]. 微计算机信息, 2006, 22(9):113-115. CHEN Jili, NIU Qinzhou. Page to weight based on feature code[J]. Micro Computer Information, 2006, 22(9):113-115. [7] 黄仁,冯胜,杨吉云,等.基于正文结构和长句提取的网页去重算法[J].计算机应用研究, 2010, 27(7):2489-2491. HUANG Ren, FENG Sheng, YANG Jiyun, et al. Descreen algorithm based on text structure and extraction of long sentences[J]. Computer Application Research, 2010, 27(7): 2489-2491. [8] 闫俊伢.基于MD5的网页去重算法的设计与研究[J]. 实验室研究与探索, 2013, 32(12):105-108. YAN Junya. Design and research of elimination algorithm based on MD5 web page[J]. Laboratory Research and Exploration, 2013, 32(12):105-108. [9] 熊忠阳,牙漫,张玉芳,等.基于网页正文结构和特征串的相似网页去重算法[J].计算机应用,2013,33(2):554-557. XIONG Zhongyang, YA Man, ZHANG Yufang, et al. Based on Web page text structure and characteristic string of similar web page to weight algorithm[J]. Computer Application, 2013, 33(2):554-557. [10] 徐朝辉,赵淑梅,闫付亮,等.一种基于特征向量的改进DSC网页去重算法[J].科学技术与工程,2013,13(8):2250-2253. XU Chaohui, ZHAO Shumei, YAN Fuliang, et al. An improved DSC page de weight algorithm based on feature vectors[J]. Science Technology and Engineering, 2013, 13(8):2250-2253. [11] 曹玉娟,牛振东,赵堃,等.基于概念和语义网络的近似网页检测算法[J].软件学报, 2011, 22(8):1816-1826. CAO Yujuan, NIU Zhendong, ZHAO Kun, et al. Approximate web page detection algorithm based on concept and semantic web[J]. Journal of Software, 2011, 22(8):1816-1826. [12] 张玉连,王莎莎,宋桂江,等.基于元搜索的网页去重算法[J].燕山大学学报, 2011, 35(2):121-123. ZHANG Yulian, WANG Shasha, SONG Guijiang, et al. A meta search based algorithm for page weight[J]. Journal of Yanshan University, 2011, 35(2):121-123. [13] 葛晓玢,刘杰,崔健,等.基于版权信息的新闻网页去重策略研究[J].电脑知识与技术, 2012, 8(26):6211-6214. GE Xiaofen, LIU Jie, CUI Jian, et al. Research on the strategy of news web page based on copyright information[J]. Computer Knowledge and Technology, 2012, 8(26):6211-6214. [14] DALVI N, OLTEANU M, RAGHAVAN M, et al. Deduplicating a places database[C] //Proceedings of the 23rd International Conference on World Wide Web. New York: ACM, 2014:409-418. [15] HENZINGER M. Finding near-duplicate web pages: a large-scale evaluation of algorithms[C] //Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2006: 284-291. [16] 王开军, 李健, 张军英,等. 聚类分析中类数估计方法的实验比较[J]. 计算机工程, 2008, 34(9):198-199. WANG Kaijun, LI Jian, ZHANG Junying, et al. An experimental comparison of the methods of class number estimation in cluster analysis[J]. Computer Engineering, 2008, 34(9):198-199. |
[1] | 张乃洲1, 曹薇2, 陈珂锐1, 李石君3. 一种基于时间感知的搜索引擎模型[J]. J4, 2013, 48(11): 80-86. |
[2] | 刘晓华1,2,韦福如2,段亚娟3,周明2. 基于语义分析的微博搜索[J]. J4, 2012, 47(5): 38-42. |
[3] | 曾剑平,吴承荣,龚凌晖. 面向分布式搜索引擎的索引库动态维护算法[J]. J4, 2011, 46(5): 24-27. |
[4] | 李智超,余慧佳,刘奕群,马少平. 网页作弊与反作弊技术综述[J]. J4, 2011, 46(5): 1-8. |
[5] | 宋春芳,石冰 . 一种基于关联规则的搜索引擎结果聚类算法[J]. J4, 2006, 41(3): 61-65 . |
[6] | 张 瑜,袁 方 . 基于用户兴趣的个性化信息检索方法[J]. J4, 2006, 41(3): 120-125 . |
|