探索数据集特征与伪相关反馈的平衡参数之间的关系

doi:10.6040/j.issn.1671-9352.1.2015.031

山东大学学报（理学版） ›› 2016, Vol. 51 ›› Issue (7): 18-22.doi: 10.6040/j.issn.1671-9352.1.2015.031

探索数据集特征与伪相关反馈的平衡参数之间的关系

孟烨,张鹏,宋大为

天津大学认知计算与应用重点实验室, 天津 300072

收稿日期:2015-11-14 出版日期:2016-07-20 发布日期:2016-07-27
作者简介:孟烨(1991— ),女,硕士研究生,研究方向为信息检索.E-mail:ye.meng04@gmail.com
基金资助:
国家重点基础研究发展计划(973计划)项目(2013CB329304,2014CB744604);国家自然科学基金资助项目(61402324,61272265);天津市应用基础与前沿研究计划项目(15JCQNJC41700)

Study on collection statistics for parameter selection in pseudo relevance feedback

MENG Ye, ZHANG Peng, SONG Da-wei

Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin 300072, China

Received:2015-11-14 Online:2016-07-20 Published:2016-07-27

摘要/Abstract

摘要： 伪相关反馈(pseudo-relevance feedback)是一种可有效提升查询性能的查询扩展技术。对这项技术而言,如何选取参数来平衡原始查询和扩展词的比重以达到最优的查询效果是一个非常重要的问题。在以往的反馈模型中,该平衡参数在所有数据集上需要设置成固定的经验值。但是,由于数据集之间的差异性,该平衡参数应该随着数据集的变化而改变。通过分析数据集的统计特征来发掘其与最优平衡参数之间的关系,进而指导最优参数的选择,主要分析了文档长度离散度、低频词项在数据集和查询扩展词中的比重等特征。通过分析在6个标准TREC数据集上的实验结果得出结论:特殊词项的比例越高,文档长度离散度越大,越需要给原始查询更大的比重。

关键词: 信息检索, 数据集特征, 伪相关反馈

Abstract: Pseudo-relevance feedback(PRF)is an effective technique used to improve the Ad hoc retrieval performance. For PRF methods, how to optimize the balance parameter between the original query model and feedback model is an important but difficult problem. In the current feedback methods, the balance parameter is often set to a fixed value across all collections. However, due to the difference among collections, this parameter should be tuned differently. In this paper, we aim to discover some meaningful clues for the optimization of the balance parameter through analyzing the statistical features of collections. We investigates the dependency between the optimal parameter and a number of collection statistics, including the standard deviation of document length(Dev(dl)), the proportion of low frequency terms in the collection(LFT_-C)and in the expansion terms. The experiments on six TREC collections demonstrate that the higher LFT_-C and Dev(dl)are, the bigger weight of the original query model should be given.

Key words: information retrieval, pseudo-relevance feedback, collection characteristics

中图分类号:

TP393

孟烨,张鹏,宋大为. 探索数据集特征与伪相关反馈的平衡参数之间的关系[J]. 山东大学学报（理学版）, 2016, 51(7): 18-22.

MENG Ye, ZHANG Peng, SONG Da-wei. Study on collection statistics for parameter selection in pseudo relevance feedback[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(7): 18-22.

参考文献

[1] CAO G, NIE J Y, GAO J, et al. Selecting good expansion terms for pseudo-relevance feedback[C] //Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2008:243-250.
[2] CLINCHANT S, GAUSSIER E. A theoretical analysis of pseudo-relevance feedback models[C] //Proceeding of the 2013 Coference on the Theory of Information Retrieval.[S.l.] :[s.n.]. 2013: 6.
[3] METZLER D, CROFT W B. Latent concept expansion using markov random fields[C] //Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2007: 311-318.
[4] XU J, CROFT W B. Query expansion using local and global document analysis[C] //Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1996: 4-11.
[5] COLLINS-THOMPSON K. Accounting for stability of retrieval algorithms using risk-reward curves[C] //Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation. New York: ACM, 2009: 27-28.
[6] LV Y, ZHAI C X. Adaptive relevance feedback in information retrieval[C] //Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 255-264.
[7] LAVRENKO V, CROFT W B. Relevance based language models[C] //Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2001: 120-127.
[8] ZHANG P, SONG D, ZHAO X, et al. A study of document weight smoothness in pseudo relevance feedback[J]. Information Retrieval Technology, 2010, 6458:527-538.
[9] YE Z, HUANG J X. A simple term frequency transformation model for effective pseudo relevance feedback[C] //Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2014: 323-332.
[10] ALLAN J, CONNELL M E, CROFT W B, et al. Inquery and trec-9[C] //Proceedings of the 9th Text Retrieval Conference(TREC-9).[S.l.] :[s.n.]. 2000:551-562.
[11] PORTER M F. An algorithm for suffix stripping[J]. Program: Electronic Library and Information Systems, 1980, 14(3):130-137.
[12] OGILVIE P, CALLAN J P. Experiments using the lemur toolkit[J]. TREC, 2001, 10:103-108.
[13] ZHAI C, LAFFERTY J. A study of smoothing methods for language models applied to ad hoc information retrieval[C] //Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2001: 334-342.

相关文章 15

[1]	王凯,洪宇,邱盈盈,王剑,姚建民,周国栋. 一种查询意图边界检测方法研究[J]. 山东大学学报（理学版）, 2017, 52(9): 13-18.
[2]	曹蓉,黄金柱,易绵竹. 信息检索—DARPA人类语言技术研究的最终指向[J]. 山东大学学报（理学版）, 2016, 51(9): 11-17.
[3]	张文雅,宋大为,张鹏. 面向垂直搜索基于本体的可读性计算模型[J]. 山东大学学报（理学版）, 2016, 51(7): 23-29.
[4]	李胜东, 吕学强, 孙军, 施水才. Lucene全文索引效率的改进[J]. 山东大学学报（理学版）, 2015, 50(07): 76-79.
[5]	许洁萍1,殷宏宇1,范子文2. 基于近似子乐句的翻唱歌曲识别研究[J]. J4, 2013, 48(7): 68-71.
[6]	孙静宇,陈俊杰,余雪丽,李鲜花. 协同Web搜索综述[J]. J4, 2011, 46(5): 9-15.
[7]	庞观松,张黎莎,蒋盛益*,邝丽敏,吴美玲. 一种基于名词短语的检索结果多层聚类方法[J]. J4, 2010, 45(7): 39-44.
[8]	王太峰,袁平波,荚济民,俞能海 . 基于新闻环境的人物肖像检索[J]. J4, 2006, 41(3): 5-10 .
[9]	曹瑛,王明文,陶红亮 . 基于Markov网络的检索模型[J]. J4, 2006, 41(3): 126-130 .
[10]	王卫东,宋丹,宋人杰 . 基于分解的向量空间模型的Web新闻信息检索[J]. J4, 2006, 41(3): 135-138 .
[11]	何靖 . 一种问答式检索系统布尔查询生成方法[J]. J4, 2006, 41(3): 13-17 .
[12]	宋春芳,石冰 . 一种基于关联规则的搜索引擎结果聚类算法[J]. J4, 2006, 41(3): 61-65 .
[13]	高翔,王敏 . 模糊聚类算法在Web信息搜索中的应用[J]. J4, 2006, 41(3): 11-12 .
[14]	万海平,何华灿 . 基于谱图的维度约简及其应用[J]. J4, 2006, 41(3): 58-60 .
[15]	胡俊刚,董守斌,陈晓志,张元丰 . 基于URL类型优先级的入口页面查询算法[J]. J4, 2006, 41(3): 76-80 .

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

探索数据集特征与伪相关反馈的平衡参数之间的关系

Study on collection statistics for parameter selection in pseudo relevance feedback

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

多维度评价

本文评价

推荐阅读 0