基于URL类型优先级的入口页面查询算法

基于URL类型优先级的入口页面查询算法

胡俊刚，董守斌，陈晓志，张元丰

华南理工大学广东省计算机网络重点实验室，广东广州 510641

收稿日期:2006-03-29 修回日期:1900-01-01 出版日期:2006-10-24 发布日期:2006-10-24
通讯作者: 胡俊刚

Entry page search algorithm based on URLtype prior probabilities

HU Jungang,DONG Shou-bin,CHEN Xiao-zhi,ZHANG Yuan-feng

Guangdong Key Laboratory of Computer Network, South China University of Technology,

Received:2006-03-29 Revised:1900-01-01 Online:2006-10-24 Published:2006-10-24
Contact: HU Jungang

摘要/Abstract

摘要： 入口页面(主页)查询结果只有一个，并且用户的查询词常常是简短的页面名称，由于它要求更高的精准度，一般认为是较为困难的. 依据语言模型分析，挖掘出对中文入口页面(entry page)检索有意义的查询域作为基准检索的内容域，同时考虑到非内容网页优先级(URLtype等)特征的重要性，建立综合内容域和非内容网页特征的检索模型. 通过URL类型优先级(URLtype prior)的概率统计，发现入口页面和其相关的子页面之间存在比较大的联系. 据此提出基于相关子页面的入口页面提取算法PERS(page extracted from relevant subpage). 对比实验数据表明，PERS算法对检索的性能有较大提高.

关键词: 入口页面检索, URL类型优先级, 信息检索

Abstract: Entry page (home page) retrieval has the goal to retrieve just one right document, and the queries are usually short Web page names. As a result, finding precisely an entry page with a high initial is quite difficult. According to unigram language model, the authors extract the field of Web page contents for baseline retrieval, which are useful for finding Chinese entry page, and then we build a new model combined contentfield and noncontents features of Web pages (e.g. URLtype prior ,proved to have the strongest predictive power). According to the prior probabilities of URLtype, the relationship between entry page and its subpages is discovered. Based on the relationship, we propose a new algorithm that entry page is extracted from relevant subpages (PERS). At last, we get the result from rerank, and achieve a great advance on performance of entry page retrieval by using PERS.

Key words: information retrieval , URLtype priority, Entry page retrieval

胡俊刚,董守斌,陈晓志,张元丰 . 基于URL类型优先级的入口页面查询算法[J]. J4, 2006, 41(3): 76-80 .

HU Jungang,DONG Shou-bin,CHEN Xiao-zhi,ZHANG Yuan-feng . Entry page search algorithm based on URLtype prior probabilities[J]. J4, 2006, 41(3): 76-80 .

参考文献

相关文章 15

[1]	王凯,洪宇,邱盈盈,王剑,姚建民,周国栋. 一种查询意图边界检测方法研究[J]. 山东大学学报（理学版）, 2017, 52(9): 13-18.
[2]	曹蓉,黄金柱,易绵竹. 信息检索—DARPA人类语言技术研究的最终指向[J]. 山东大学学报（理学版）, 2016, 51(9): 11-17.
[3]	张文雅,宋大为,张鹏. 面向垂直搜索基于本体的可读性计算模型[J]. 山东大学学报（理学版）, 2016, 51(7): 23-29.
[4]	孟烨,张鹏,宋大为. 探索数据集特征与伪相关反馈的平衡参数之间的关系[J]. 山东大学学报（理学版）, 2016, 51(7): 18-22.
[5]	李胜东, 吕学强, 孙军, 施水才. Lucene全文索引效率的改进[J]. 山东大学学报（理学版）, 2015, 50(07): 76-79.
[6]	许洁萍1,殷宏宇1,范子文2. 基于近似子乐句的翻唱歌曲识别研究[J]. J4, 2013, 48(7): 68-71.
[7]	孙静宇,陈俊杰,余雪丽,李鲜花. 协同Web搜索综述[J]. J4, 2011, 46(5): 9-15.
[8]	庞观松,张黎莎,蒋盛益*,邝丽敏,吴美玲. 一种基于名词短语的检索结果多层聚类方法[J]. J4, 2010, 45(7): 39-44.
[9]	王太峰,袁平波,荚济民,俞能海 . 基于新闻环境的人物肖像检索[J]. J4, 2006, 41(3): 5-10 .
[10]	曹瑛,王明文,陶红亮 . 基于Markov网络的检索模型[J]. J4, 2006, 41(3): 126-130 .
[11]	王卫东,宋丹,宋人杰 . 基于分解的向量空间模型的Web新闻信息检索[J]. J4, 2006, 41(3): 135-138 .
[12]	何靖 . 一种问答式检索系统布尔查询生成方法[J]. J4, 2006, 41(3): 13-17 .
[13]	宋春芳,石冰 . 一种基于关联规则的搜索引擎结果聚类算法[J]. J4, 2006, 41(3): 61-65 .
[14]	高翔,王敏 . 模糊聚类算法在Web信息搜索中的应用[J]. J4, 2006, 41(3): 11-12 .
[15]	万海平,何华灿 . 基于谱图的维度约简及其应用[J]. J4, 2006, 41(3): 58-60 .

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed