JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2016, Vol. 51 ›› Issue (7): 11-17.doi: 10.6040/j.issn.1671-9352.1.2015.060

Previous Articles     Next Articles

Deduplicating search results of cloud disk resources using meta-information

LIU Chi, YAN Hong-fei   

  1. Institute of Network Computing and Information Systems, Peking University, Beijing 100871, China
  • Received:2015-11-14 Online:2016-07-20 Published:2016-07-27

Abstract: Different from classical duplicate detection methods which calculating text similarity of web pages, the multi-media cloud disk resources only have limited meta-information to deduplicate search results. The research is based on a newly established cloud disk resources search engine. This paper analyzed the characteristic of cloud disk resource meta-information, finding that besides resource names, extension filename, size and ownership are significant features to detect duplicate records. According to this, this paper proposed a feature normalization method and trained an unsupervised method to capture the task. Experiments proved that this method is able to solve the cloud disk resources search results deduplicating problem effectively.

Key words: search engine, deduplicate, meta-information, cloud disk resources

CLC Number: 

  • TP393
[1] RISTAD E S, YIANILOS P N. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5):522-532.
[2] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection: a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[3] GIBSON D, PUNERA K, TOMKINS A. The volume and evolution of web page templates[C] //Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005: 830-839.
[4] FETTERLY D, MANASSE M, NAJORK M. Detecting phrase-level duplication on the world wide web[C] //Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2005: 170-177.
[5] MANKU G S, JAIN A, DAS S A. Detecting near-duplicates for web crawling[C] //Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 141-150.
[6] 陈基漓,牛秦洲.基于特征码的网页去重[J]. 微计算机信息, 2006, 22(9):113-115. CHEN Jili, NIU Qinzhou. Page to weight based on feature code[J]. Micro Computer Information, 2006, 22(9):113-115.
[7] 黄仁,冯胜,杨吉云,等.基于正文结构和长句提取的网页去重算法[J].计算机应用研究, 2010, 27(7):2489-2491. HUANG Ren, FENG Sheng, YANG Jiyun, et al. Descreen algorithm based on text structure and extraction of long sentences[J]. Computer Application Research, 2010, 27(7): 2489-2491.
[8] 闫俊伢.基于MD5的网页去重算法的设计与研究[J]. 实验室研究与探索, 2013, 32(12):105-108. YAN Junya. Design and research of elimination algorithm based on MD5 web page[J]. Laboratory Research and Exploration, 2013, 32(12):105-108.
[9] 熊忠阳,牙漫,张玉芳,等.基于网页正文结构和特征串的相似网页去重算法[J].计算机应用,2013,33(2):554-557. XIONG Zhongyang, YA Man, ZHANG Yufang, et al. Based on Web page text structure and characteristic string of similar web page to weight algorithm[J]. Computer Application, 2013, 33(2):554-557.
[10] 徐朝辉,赵淑梅,闫付亮,等.一种基于特征向量的改进DSC网页去重算法[J].科学技术与工程,2013,13(8):2250-2253. XU Chaohui, ZHAO Shumei, YAN Fuliang, et al. An improved DSC page de weight algorithm based on feature vectors[J]. Science Technology and Engineering, 2013, 13(8):2250-2253.
[11] 曹玉娟,牛振东,赵堃,等.基于概念和语义网络的近似网页检测算法[J].软件学报, 2011, 22(8):1816-1826. CAO Yujuan, NIU Zhendong, ZHAO Kun, et al. Approximate web page detection algorithm based on concept and semantic web[J]. Journal of Software, 2011, 22(8):1816-1826.
[12] 张玉连,王莎莎,宋桂江,等.基于元搜索的网页去重算法[J].燕山大学学报, 2011, 35(2):121-123. ZHANG Yulian, WANG Shasha, SONG Guijiang, et al. A meta search based algorithm for page weight[J]. Journal of Yanshan University, 2011, 35(2):121-123.
[13] 葛晓玢,刘杰,崔健,等.基于版权信息的新闻网页去重策略研究[J].电脑知识与技术, 2012, 8(26):6211-6214. GE Xiaofen, LIU Jie, CUI Jian, et al. Research on the strategy of news web page based on copyright information[J]. Computer Knowledge and Technology, 2012, 8(26):6211-6214.
[14] DALVI N, OLTEANU M, RAGHAVAN M, et al. Deduplicating a places database[C] //Proceedings of the 23rd International Conference on World Wide Web. New York: ACM, 2014:409-418.
[15] HENZINGER M. Finding near-duplicate web pages: a large-scale evaluation of algorithms[C] //Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2006: 284-291.
[16] 王开军, 李健, 张军英,等. 聚类分析中类数估计方法的实验比较[J]. 计算机工程, 2008, 34(9):198-199. WANG Kaijun, LI Jian, ZHANG Junying, et al. An experimental comparison of the methods of class number estimation in cluster analysis[J]. Computer Engineering, 2008, 34(9):198-199.
[1] ZHANG Nai-zhou1, CAO Wei 2, CHEN Ke-rui 1, LI Shi-jun3. A temporal-aware model for search engine [J]. J4, 2013, 48(11): 80-86.
[2] LIU Xiao-hua1,2, WEI Fu-ru2, DUAN Ya-juan3, ZHOU Ming2. Semantic search of microblogs [J]. J4, 2012, 47(5): 38-42.
[3] ZENG Jian-ping, WU Cheng-rong, GONG Ling-hui. Algorithm of dynamic maintaince of index library for a distributed search engine [J]. J4, 2011, 46(5): 24-27.
[4] LI Zhi-chao, YU Hui-jia, LIU Yi-qun, MA Shao-ping. A survey of web spam and anti-spam techniques [J]. J4, 2011, 46(5): 1-8.
[5] SONG Chun-fang,SHI Bing . An algorithm to cluster the search results basedon the association rules [J]. J4, 2006, 41(3): 61-65 .
[6] ZHANG Yu,YUAN Fang . A user interest modelbased personalized information [J]. J4, 2006, 41(3): 120-125 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!