《山东大学学报(理学版)》 ›› 2026, Vol. 61 ›› Issue (3): 66-74.doi: 10.6040/j.issn.1671-9352.1.2024.061
• • 上一篇
朱铭洋1,2, 黄于欣1,2, 余正涛1,2*
ZHU Mingyang1,2, HUANG Yuxin1,2, YU Zhengtao1,2*
摘要: 查询缩略旨在通过简化和精炼冗长的查询输入,保留其中的关键信息来提升检索结果的召回率和准确率。然而,传统方法通常是基于统计或基于预训练模型来提取冗长查询中的关键词作为检索输入,难以应对查询的复杂性(如同义词和多义词),且在保留查询核心内容时容易丢失关键信息。针对以上问题,提出一种融合关键概念和潜在概念的冗长查询缩略方法,将代表查询核心内容的关键概念和对理解查询重要但未明确表达的潜在概念相结合,从而生成更完整和有效的查询。具体而言,首先利用预训练模型来生成简短有效的查询作为关键概念,然后使用伪相关反馈方法从原始查询的相关文档集中挖掘潜在概念,最后,将两者聚合作为最终的查询缩略结果,实现冗长查询检索。实验结果表明,在Robust2004数据集上使用密集检索模型评估时,相比基线模型,文中提出的方法在R@1000和NDCG@10两个指标上分别提高2.1%和3.6%。
中图分类号:
| [1] KIM H, CHOI M, LEE S, et al. ConQueR: contextualized query reduction using search logs[C] //Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. Taipei: ACM, 2023:1899-1903. [2] CAMPOS R, MANGARAVITE V, PASQUALI A, et al. YAKE! Keyword extraction from single documents using multiple local features[J]. Information Sciences, 2020, 509:257-289. [3] DEVLIN J, CHANG M-W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: ACL, 2019:4171-4186. [4] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21(140):1-67. [5] HUSTON S, CROFT W B. Evaluating verbose query processing techniques[C] //Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Shanghai: ACM, 2010:291-298. [6] CHAA M, NOUALI O, BELLOT P. New technique to deal with verbose queries in social book search[C] //Proceedings of the International Conference on Web Intelligence. Jinan: IEEE, 2017:799-806. [7] KUMARAN G, CARVALHO V R. Reducing long queries using query quality predictors[C] //Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Fuji: ACM, 2009:564-571. [8] ROUSSEAU F, VAZIRGIANNIS M. Main core retention on graph-of-words for single-document keyword extraction[C] //Advances in Information Retrieval: 37th European Conference on IR Research. Vienna: Springer, 2015:382-393. [9] BOUGOUIN A, BOUDIN F, DAILLE B. Topicrank: graph-based topic ranking for keyphrase extraction[C] //International Joint Conference on Natural Language Processing(IJCNLP). Nagoya: ACL, 2013:543-551. [10] PODDER D, PAIK J H, MITRA P. Neural language model based attentive term dependence model for verbose query(student abstract)[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023:16300-16301. [11] PRIYANSHU A, VIJAY S. AdaptKeyBERT: an attention-based approach towards few-shot & zero-shot domain adaptation of keybert[EB/OL].(2022-11-16)[2024-09-15]. https://arxiv.org/abs/2211.07499. [12] VASWANI A. Attention is all you need[C] //Advances in Neural Information Processing Systems. Long Beach: NIPS, 2017:1-15. [13] KHATTAB O, ZAHARIA M. Colbert: efficient and effective passage search via contextualized late interaction over BERT[C] //Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle: ACM, 2020:39-48. [14] VOORHEES E M. Overview of the TREC 2004 robust track[C] // Text Retrieval Conference. Washington: NIST, 2004:1-12. [15] ZHAI C, LAFFERTY J. Model-based feedback in the language modeling approach to information retrieval[C] //Proceedings of the Tenth International Conference on Information and Knowledge Management. Atlanta: ACM, 2001:403-410. [16] YU H C, XIONG C, CALLAN J. Improving query representations for dense retrieval with pseudo relevance feedback[C] //Proceedings of the 30th ACM International Conference on Information & Knowledge Management. Seattle: ACM, 2021: 3592-3596. [17] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL].(2015-07-23)[2024-09-15]. https://arxiv.org/abs/1412.6980. [18] JIN Xiaobo, GENG Guanggang, XIE Guosen, et al. Approximately optimizing NDCG using pair-wise loss[J]. Information Sciences, 2018, 453:50-65. [19] GROS D, HABERMANN T, KIRSTEIN G, et al. Anaphora resolution: analysing the impact on mean average precision and detecting limitations of automated approaches[J]. International Journal of Information Retrieval Research, 2018, 8(3):33-45. |
| [1] | 王佳麒,杨沐昀,赵铁军,赵臻宇. 检务文书检索数据集的构建[J]. 《山东大学学报(理学版)》, 2020, 55(7): 81-87. |
| [2] | 王凯,洪宇,邱盈盈,王剑,姚建民,周国栋. 一种查询意图边界检测方法研究[J]. 山东大学学报(理学版), 2017, 52(9): 13-18. |
| [3] | 曹蓉,黄金柱,易绵竹. 信息检索—DARPA人类语言技术研究的最终指向[J]. 山东大学学报(理学版), 2016, 51(9): 11-17. |
| [4] | 孟烨,张鹏,宋大为. 探索数据集特征与伪相关反馈的平衡参数之间的关系[J]. 山东大学学报(理学版), 2016, 51(7): 18-22. |
| [5] | 张文雅,宋大为,张鹏. 面向垂直搜索基于本体的可读性计算模型[J]. 山东大学学报(理学版), 2016, 51(7): 23-29. |
| [6] | 李胜东, 吕学强, 孙军, 施水才. Lucene全文索引效率的改进[J]. 山东大学学报(理学版), 2015, 50(07): 76-79. |
| [7] | 许洁萍1,殷宏宇1,范子文2. 基于近似子乐句的翻唱歌曲识别研究[J]. J4, 2013, 48(7): 68-71. |
| [8] | 孙静宇,陈俊杰,余雪丽,李鲜花. 协同Web搜索综述[J]. J4, 2011, 46(5): 9-15. |
| [9] | 庞观松,张黎莎,蒋盛益*,邝丽敏,吴美玲. 一种基于名词短语的检索结果多层聚类方法[J]. J4, 2010, 45(7): 39-44. |
| [10] | 万海平,何华灿 . 基于谱图的维度约简及其应用[J]. J4, 2006, 41(3): 58-60 . |
| [11] | 王太峰,袁平波,荚济民,俞能海 . 基于新闻环境的人物肖像检索[J]. J4, 2006, 41(3): 5-10 . |
| [12] | 付雪峰,刘邱云,王明文 . 基于互信息的粗糙集信息检索模型[J]. J4, 2006, 41(3): 116-119 . |
| [13] | 宋春芳,石冰 . 一种基于关联规则的搜索引擎结果聚类算法[J]. J4, 2006, 41(3): 61-65 . |
| [14] | 何 靖 . 一种问答式检索系统布尔查询生成方法[J]. J4, 2006, 41(3): 13-17 . |
| [15] | 高 翔,王 敏 . 模糊聚类算法在Web信息搜索中的应用[J]. J4, 2006, 41(3): 11-12 . |
|
||