您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (7): 11-17.doi: 10.6040/j.issn.1671-9352.1.2015.060

• • 上一篇    下一篇

基于元信息的云盘资源检索结果去重

刘驰,闫宏飞   

  1. 北京大学网络与信息系统研究所, 北京 100871
  • 收稿日期:2015-11-14 出版日期:2016-07-20 发布日期:2016-07-27
  • 作者简介:刘驰(1991— ),男,硕士研究生,研究方向为搜索引擎与数据挖掘.E-mail: liuchi09@gmail.com
  • 基金资助:
    国家重点基础研究发展计划(973计划)项目(2014CB340400);国家自然科学基金资助项目(61272340,61472013)

Deduplicating search results of cloud disk resources using meta-information

LIU Chi, YAN Hong-fei   

  1. Institute of Network Computing and Information Systems, Peking University, Beijing 100871, China
  • Received:2015-11-14 Online:2016-07-20 Published:2016-07-27

摘要: 区别于传统计算网页文本相似度的去重方法,以多媒体数据文件为主的云盘资源仅可利用相当有限的元信息进行检索结果去重。针对这一问题,以搭建的面向云盘资源数据的搜索引擎系统为基础,通过对云盘资源元信息特性的分析,发现除名称之外,资源文件后缀名、占用空间大小、资源的用户归属是判定重复记录的有效特征。在此基础上,给出了处理上述特征的归一化方法,进而使用无监督方法进行去重。实验结果表明,该方法能够有效对云盘资源检索结果去重。

关键词: 搜索引擎, 云盘资源, 元信息, 去重

Abstract: Different from classical duplicate detection methods which calculating text similarity of web pages, the multi-media cloud disk resources only have limited meta-information to deduplicate search results. The research is based on a newly established cloud disk resources search engine. This paper analyzed the characteristic of cloud disk resource meta-information, finding that besides resource names, extension filename, size and ownership are significant features to detect duplicate records. According to this, this paper proposed a feature normalization method and trained an unsupervised method to capture the task. Experiments proved that this method is able to solve the cloud disk resources search results deduplicating problem effectively.

Key words: search engine, deduplicate, meta-information, cloud disk resources

中图分类号: 

  • TP393
[1] RISTAD E S, YIANILOS P N. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5):522-532.
[2] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection: a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[3] GIBSON D, PUNERA K, TOMKINS A. The volume and evolution of web page templates[C] //Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005: 830-839.
[4] FETTERLY D, MANASSE M, NAJORK M. Detecting phrase-level duplication on the world wide web[C] //Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2005: 170-177.
[5] MANKU G S, JAIN A, DAS S A. Detecting near-duplicates for web crawling[C] //Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 141-150.
[6] 陈基漓,牛秦洲.基于特征码的网页去重[J]. 微计算机信息, 2006, 22(9):113-115. CHEN Jili, NIU Qinzhou. Page to weight based on feature code[J]. Micro Computer Information, 2006, 22(9):113-115.
[7] 黄仁,冯胜,杨吉云,等.基于正文结构和长句提取的网页去重算法[J].计算机应用研究, 2010, 27(7):2489-2491. HUANG Ren, FENG Sheng, YANG Jiyun, et al. Descreen algorithm based on text structure and extraction of long sentences[J]. Computer Application Research, 2010, 27(7): 2489-2491.
[8] 闫俊伢.基于MD5的网页去重算法的设计与研究[J]. 实验室研究与探索, 2013, 32(12):105-108. YAN Junya. Design and research of elimination algorithm based on MD5 web page[J]. Laboratory Research and Exploration, 2013, 32(12):105-108.
[9] 熊忠阳,牙漫,张玉芳,等.基于网页正文结构和特征串的相似网页去重算法[J].计算机应用,2013,33(2):554-557. XIONG Zhongyang, YA Man, ZHANG Yufang, et al. Based on Web page text structure and characteristic string of similar web page to weight algorithm[J]. Computer Application, 2013, 33(2):554-557.
[10] 徐朝辉,赵淑梅,闫付亮,等.一种基于特征向量的改进DSC网页去重算法[J].科学技术与工程,2013,13(8):2250-2253. XU Chaohui, ZHAO Shumei, YAN Fuliang, et al. An improved DSC page de weight algorithm based on feature vectors[J]. Science Technology and Engineering, 2013, 13(8):2250-2253.
[11] 曹玉娟,牛振东,赵堃,等.基于概念和语义网络的近似网页检测算法[J].软件学报, 2011, 22(8):1816-1826. CAO Yujuan, NIU Zhendong, ZHAO Kun, et al. Approximate web page detection algorithm based on concept and semantic web[J]. Journal of Software, 2011, 22(8):1816-1826.
[12] 张玉连,王莎莎,宋桂江,等.基于元搜索的网页去重算法[J].燕山大学学报, 2011, 35(2):121-123. ZHANG Yulian, WANG Shasha, SONG Guijiang, et al. A meta search based algorithm for page weight[J]. Journal of Yanshan University, 2011, 35(2):121-123.
[13] 葛晓玢,刘杰,崔健,等.基于版权信息的新闻网页去重策略研究[J].电脑知识与技术, 2012, 8(26):6211-6214. GE Xiaofen, LIU Jie, CUI Jian, et al. Research on the strategy of news web page based on copyright information[J]. Computer Knowledge and Technology, 2012, 8(26):6211-6214.
[14] DALVI N, OLTEANU M, RAGHAVAN M, et al. Deduplicating a places database[C] //Proceedings of the 23rd International Conference on World Wide Web. New York: ACM, 2014:409-418.
[15] HENZINGER M. Finding near-duplicate web pages: a large-scale evaluation of algorithms[C] //Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2006: 284-291.
[16] 王开军, 李健, 张军英,等. 聚类分析中类数估计方法的实验比较[J]. 计算机工程, 2008, 34(9):198-199. WANG Kaijun, LI Jian, ZHANG Junying, et al. An experimental comparison of the methods of class number estimation in cluster analysis[J]. Computer Engineering, 2008, 34(9):198-199.
[1] 张乃洲1, 曹薇2, 陈珂锐1, 李石君3. 一种基于时间感知的搜索引擎模型[J]. J4, 2013, 48(11): 80-86.
[2] 刘晓华1,2,韦福如2,段亚娟3,周明2. 基于语义分析的微博搜索[J]. J4, 2012, 47(5): 38-42.
[3] 曾剑平,吴承荣,龚凌晖. 面向分布式搜索引擎的索引库动态维护算法[J]. J4, 2011, 46(5): 24-27.
[4] 李智超,余慧佳,刘奕群,马少平. 网页作弊与反作弊技术综述[J]. J4, 2011, 46(5): 1-8.
[5] 宋春芳,石冰 . 一种基于关联规则的搜索引擎结果聚类算法[J]. J4, 2006, 41(3): 61-65 .
[6] 张 瑜,袁 方 . 基于用户兴趣的个性化信息检索方法[J]. J4, 2006, 41(3): 120-125 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!