您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2026, Vol. 61 ›› Issue (3): 66-74.doi: 10.6040/j.issn.1671-9352.1.2024.061

• • 上一篇    

融合关键概念和潜在概念的冗长查询缩略方法

朱铭洋1,2, 黄于欣1,2, 余正涛1,2*   

  1. 1. 昆明理工大学信息工程与自动化学院, 云南 昆明 650500;2.昆明理工大学云南省人工智能重点实验室, 云南 昆明 650500
  • 发布日期:2026-03-18
  • 通讯作者: 余正涛(1970— ),男,教授,博士生导师,博士,研究方向为自然语言处理、信息检索、机器翻译. E-mail:ztyu@hotmail.com
  • 作者简介:朱铭洋(2001— ),男,硕士研究生,研究方向为自然语言处理、信息检索. E-mail:969988932@qq.com*通信作者:余正涛(1970— ),男,教授,博士生导师,博士,研究方向为自然语言处理、信息检索、机器翻译. E-mail:ztyu@hotmail.com
  • 基金资助:
    国家自然科学基金资助项目(62266027,U21B2027,U23A20388);云南省科技重大专项资助项目(202302AD080003,202303AP140008);云南省基础研究重大专项资助项目(202401BC070021);昆明理工大学“双一流”创建联合专项资助项目(202201BE070001-021)

Method for verbose queries reduction by integrating key and latent concepts

ZHU Mingyang1,2, HUANG Yuxin1,2, YU Zhengtao1,2*   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China;
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
  • Published:2026-03-18

摘要: 查询缩略旨在通过简化和精炼冗长的查询输入,保留其中的关键信息来提升检索结果的召回率和准确率。然而,传统方法通常是基于统计或基于预训练模型来提取冗长查询中的关键词作为检索输入,难以应对查询的复杂性(如同义词和多义词),且在保留查询核心内容时容易丢失关键信息。针对以上问题,提出一种融合关键概念和潜在概念的冗长查询缩略方法,将代表查询核心内容的关键概念和对理解查询重要但未明确表达的潜在概念相结合,从而生成更完整和有效的查询。具体而言,首先利用预训练模型来生成简短有效的查询作为关键概念,然后使用伪相关反馈方法从原始查询的相关文档集中挖掘潜在概念,最后,将两者聚合作为最终的查询缩略结果,实现冗长查询检索。实验结果表明,在Robust2004数据集上使用密集检索模型评估时,相比基线模型,文中提出的方法在R@1000和NDCG@10两个指标上分别提高2.1%和3.6%。

关键词: 信息检索, 冗长查询, 查询缩略, 关键概念, 潜在概念

Abstract: Query reduction aims to enhance retrieval recall and precision by simplifying and condensing lengthy queries while retaining key information. Traditional methods often rely on statistical approaches or pre-trained models to extract keywords from lengthy queries for retrieval input. However, these methods struggle with query complexity(e.g., synonym and polyseme)and often lose crucial information. To address these issues, a method integrating key concepts and latent concepts for verbose query reduction is proposed. This approach integrates key concepts representing the core content of the query with latent concepts crucial for query understanding but not explicitly expressed to generate more comprehensive and effective queries. Specifically, pre-trained models generate concise and effective queries as key concepts, while pseudo-relevance feedback methods extract latent concepts from relevant document sets of the original query. Finally, both are combined to form the query reduction for improved retrieval. Experimental results on the Robust2004 dataset using a dense retrieval model show that the proposed method improves R@1000 and NDCG@10 by 2.1% and 3.6%, respectively, compared to baseline models.

Key words: information retrieval, verbose query, query reduction, key concept, latent concept

中图分类号: 

  • TP391
[1] KIM H, CHOI M, LEE S, et al. ConQueR: contextualized query reduction using search logs[C] //Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. Taipei: ACM, 2023:1899-1903.
[2] CAMPOS R, MANGARAVITE V, PASQUALI A, et al. YAKE! Keyword extraction from single documents using multiple local features[J]. Information Sciences, 2020, 509:257-289.
[3] DEVLIN J, CHANG M-W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: ACL, 2019:4171-4186.
[4] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21(140):1-67.
[5] HUSTON S, CROFT W B. Evaluating verbose query processing techniques[C] //Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Shanghai: ACM, 2010:291-298.
[6] CHAA M, NOUALI O, BELLOT P. New technique to deal with verbose queries in social book search[C] //Proceedings of the International Conference on Web Intelligence. Jinan: IEEE, 2017:799-806.
[7] KUMARAN G, CARVALHO V R. Reducing long queries using query quality predictors[C] //Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Fuji: ACM, 2009:564-571.
[8] ROUSSEAU F, VAZIRGIANNIS M. Main core retention on graph-of-words for single-document keyword extraction[C] //Advances in Information Retrieval: 37th European Conference on IR Research. Vienna: Springer, 2015:382-393.
[9] BOUGOUIN A, BOUDIN F, DAILLE B. Topicrank: graph-based topic ranking for keyphrase extraction[C] //International Joint Conference on Natural Language Processing(IJCNLP). Nagoya: ACL, 2013:543-551.
[10] PODDER D, PAIK J H, MITRA P. Neural language model based attentive term dependence model for verbose query(student abstract)[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023:16300-16301.
[11] PRIYANSHU A, VIJAY S. AdaptKeyBERT: an attention-based approach towards few-shot & zero-shot domain adaptation of keybert[EB/OL].(2022-11-16)[2024-09-15]. https://arxiv.org/abs/2211.07499.
[12] VASWANI A. Attention is all you need[C] //Advances in Neural Information Processing Systems. Long Beach: NIPS, 2017:1-15.
[13] KHATTAB O, ZAHARIA M. Colbert: efficient and effective passage search via contextualized late interaction over BERT[C] //Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle: ACM, 2020:39-48.
[14] VOORHEES E M. Overview of the TREC 2004 robust track[C] // Text Retrieval Conference. Washington: NIST, 2004:1-12.
[15] ZHAI C, LAFFERTY J. Model-based feedback in the language modeling approach to information retrieval[C] //Proceedings of the Tenth International Conference on Information and Knowledge Management. Atlanta: ACM, 2001:403-410.
[16] YU H C, XIONG C, CALLAN J. Improving query representations for dense retrieval with pseudo relevance feedback[C] //Proceedings of the 30th ACM International Conference on Information & Knowledge Management. Seattle: ACM, 2021: 3592-3596.
[17] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL].(2015-07-23)[2024-09-15]. https://arxiv.org/abs/1412.6980.
[18] JIN Xiaobo, GENG Guanggang, XIE Guosen, et al. Approximately optimizing NDCG using pair-wise loss[J]. Information Sciences, 2018, 453:50-65.
[19] GROS D, HABERMANN T, KIRSTEIN G, et al. Anaphora resolution: analysing the impact on mean average precision and detecting limitations of automated approaches[J]. International Journal of Information Retrieval Research, 2018, 8(3):33-45.
[1] 王佳麒,杨沐昀,赵铁军,赵臻宇. 检务文书检索数据集的构建[J]. 《山东大学学报(理学版)》, 2020, 55(7): 81-87.
[2] 王凯,洪宇,邱盈盈,王剑,姚建民,周国栋. 一种查询意图边界检测方法研究[J]. 山东大学学报(理学版), 2017, 52(9): 13-18.
[3] 曹蓉,黄金柱,易绵竹. 信息检索—DARPA人类语言技术研究的最终指向[J]. 山东大学学报(理学版), 2016, 51(9): 11-17.
[4] 孟烨,张鹏,宋大为. 探索数据集特征与伪相关反馈的平衡参数之间的关系[J]. 山东大学学报(理学版), 2016, 51(7): 18-22.
[5] 张文雅,宋大为,张鹏. 面向垂直搜索基于本体的可读性计算模型[J]. 山东大学学报(理学版), 2016, 51(7): 23-29.
[6] 李胜东, 吕学强, 孙军, 施水才. Lucene全文索引效率的改进[J]. 山东大学学报(理学版), 2015, 50(07): 76-79.
[7] 许洁萍1,殷宏宇1,范子文2. 基于近似子乐句的翻唱歌曲识别研究[J]. J4, 2013, 48(7): 68-71.
[8] 孙静宇,陈俊杰,余雪丽,李鲜花. 协同Web搜索综述[J]. J4, 2011, 46(5): 9-15.
[9] 庞观松,张黎莎,蒋盛益*,邝丽敏,吴美玲. 一种基于名词短语的检索结果多层聚类方法[J]. J4, 2010, 45(7): 39-44.
[10] 万海平,何华灿 . 基于谱图的维度约简及其应用[J]. J4, 2006, 41(3): 58-60 .
[11] 王太峰,袁平波,荚济民,俞能海 . 基于新闻环境的人物肖像检索[J]. J4, 2006, 41(3): 5-10 .
[12] 付雪峰,刘邱云,王明文 . 基于互信息的粗糙集信息检索模型[J]. J4, 2006, 41(3): 116-119 .
[13] 宋春芳,石冰 . 一种基于关联规则的搜索引擎结果聚类算法[J]. J4, 2006, 41(3): 61-65 .
[14] 何 靖 . 一种问答式检索系统布尔查询生成方法[J]. J4, 2006, 41(3): 13-17 .
[15] 高 翔,王 敏 . 模糊聚类算法在Web信息搜索中的应用[J]. J4, 2006, 41(3): 11-12 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!