您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (7): 23-29.doi: 10.6040/j.issn.1671-9352.1.2015.069

• • 上一篇    下一篇

面向垂直搜索基于本体的可读性计算模型

张文雅,宋大为*,张鹏   

  1. 天津大学计算机科学与技术学院, 天津 300072
  • 收稿日期:2015-11-14 出版日期:2016-07-20 发布日期:2016-07-27
  • 通讯作者: 宋大伟(1972— ),男,博士,教授,研究方向为信息检索. E-mail:dawei.song2010@gmail.com E-mail:wenyazhang@tju.edu.cn
  • 作者简介:张文雅(1989— ),女,硕士研究生,研究方向为信息检索. E-mail:wenyazhang@tju.edu.cn
  • 基金资助:
    国家重点基础研究发展计划“973计划”项目(2013CB329304,2014CB744604);国家自然科学基金资助项目(61402324,61272265);教育部博士点基金资助项目(20130032120044)

An ontology-based readability model for vertical search

ZHANG Wen-ya, SONG Da-wei*, ZHANG Peng   

  1. School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
  • Received:2015-11-14 Online:2016-07-20 Published:2016-07-27

摘要: 作为一项新兴的信息检索评价指标,可读性在文档相关性、实用性以及质量评估中占据重要地位。其中,如何为用户提供相关可读的文档已成为垂直搜索领域一个亟待解决的问题。为了有效解决这个问题,提出了一种基于本体结构的可读性计算模型。该模型以用户的阅读抽象过程为背景,分别从语篇表面层次和概念层次对文本进行可读性计算,从而引入了3个可读性指标,即概念势、概念域和文档连贯性。具体地是将单个指标或者指标组合计算所得可读性得分融入传统垂直检索模型中,对文档初次检索结果进行重排。在医学领域中,用户实验结果表明基于本体概念序列信息的可读性指标相对于传统的非序列化指标可以更加有效地预测文档的真实可读性水平。系统实验结果进一步说明了基于可读性的重排序模型可以兼顾文档的相关性和可读性,提升垂直领域信息检索性能。

关键词: 特定领域信息检索, 文档重排, 可读性

Abstract: As an emerging evaluation criteria of information retrieval(IR), readability plays an important role in accessing documents relevance, utility and quality. How to provide different users with relevant and readable documents has been an urgent problem in vertical search. In order to solve this problem, we propose a new ontology-based readability method. Based on users’ reading process, we measure documents readability from surface and conceptual levels. In this model, three readability indicator shave been introduced, i.e., Concept Topography, Concept Scope and Document Coherence. Specifically, the readability of a document that computed by individual or combined indicators can be used to re-rank the initial lists of documents which are returned by a conventional search engine. In medical domain, the user-oriented evaluations show that our model has good correlation with humans’ judgments in readability prediction. And our model is also competitive compared with one of the state-of-the-artreadability models in system-orient edevaluation.

Key words: readability, documents re-ranking, vertical search

中图分类号: 

  • TP393
[1] KIM J Y, COLLINS-THOMPSON K, BENNETT P N, et al. Characterizing web content, user interests, and searchbehavior by reading level and topic[C] // Proceedings of the 5 ACM International Conference on Web Search and Data Mining. New York: ACM, 2012: 213-222.
[2] ZHANG Y, ZHANG J, LEASE M, et al. Multidimensional relevance modeling via psychometrics and crowdsourcing[C] // Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM, 2014:435-444.
[3] ZUCCON G, KOOPMAN B. Integrating understandability in the evaluation of consumer health search engines[C] // Proceedings of the SIGIR Workshop on Medical Information Retrieval. New York: MedIR@SIGIR, 2014: 32-35.
[4] BENDERSKY M, CROFT W B, DIAO Y. Quality-biased ranking of web documents[C] // Proceedings of the 4 ACM International Conference on Web Search and Data Mining. New York: ACM, 2011: 95-104.
[5] YILMAZ E, VERMA M, CRASWELL N, et al. Relevance and effort: an analysis of document utility[C] // Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York: ACM, 2014: 91-100.
[6] TENENBAUM J B, KEMP C, GRIFFITHS T L, et al. How to grow a mind: statistics, structure, and abstraction[J]. Science, 2011, 331(6022):1279-1285.
[7] CHALL J S, DALE E. Readability revisited: the new Dale-Chall readability formula[M]. Cambridge: Massachusetts: Brookline Books, 1995.
[8] SCHWARM S E, OSTENDORFM. Reading level assessment using support vector machines and statistical language models[C] // Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. New York: ACM, 2005: 523-530.
[9] PETERSEN S E, OSTENDORF M. A machine learning approach to reading level assessment[J]. Computer Speech & Language, 2009, 23(1):89-106.
[10] CROSSLEY S A, DUFTY D F, MCCARTHY P M, et al. Toward a new readability: a mixed model approach[C] // Proceedings of the 29th Annual Conference of the Cognitive Science Society. New York: ACM, 2007: 197-202.
[11] PITLER E, NENKOVA A. Revisiting readability: a unified framework for predicting text quality[C] // Proceedings of the Conference on Empirical Methods in Natural Language Processing Association for Computational Linguistics. New York: ACM, 2008: 186-195.
[12] HEILMAN M J, COLLINS-THOMPSON K, CALLAN J, et al. Combining lexical and grammatical features to improve readability measures for first and second language texts[J]. Proceedings of NAACL HLT.[S.l.] :[s.n.] , 2007: 460-467.
[13] KIM H, GORYACHEV S, ROSEMBLAT G, et al. Beyond surface characteristics: a new health text-specific readability measurement[J]. AMIA Annual Symposium Proceedings. American Medical Informatics Association, 2007, 2007: 418.
[14] SHOAIB J, QIAN X, LAM W. N-gram fragment sequence based unsupervised domain-specific document readability[J]. Proceedings of COLING.[S.l.] :[s.n.] , 2012: 1309-1326.
[15] YAN X, LAU R Y K, SONG D, et al. Toward a semantic granularity model for domain-specific information retrieval[J]. ACM Transactions on Information Systems(TOIS), 2011, 29(3): 15. DOI: 10.1145/1993036.1993039.
[16] YAN X, SONG D, LI X. Concept-based document readability in domain specific information retrieval[C] // Proceedings of the 15th ACM International Conference on Information and Knowledge Management. New York: ACM, 2006: 540-549.
[1] 李艳平,齐艳姣,张凯,魏旭光. 支持用户撤销的多授权机构的属性加密方案[J]. 山东大学学报(理学版), 2018, 53(7): 75-84.
[2] 章广志,蔡绍斌,马春华,张东秋. 最大距离可分码在网络编码纠错中的应用[J]. 山东大学学报(理学版), 2018, 53(1): 75-82.
[3] 李阳,程雄,童言,陈伟,秦涛,张剑,徐明迪. 基于流量统计特征的潜在威胁用户挖掘方法[J]. 山东大学学报(理学版), 2018, 53(1): 83-88.
[4] 赵光远,秦丰林,郭晓东. 基于P2P的网络测量云平台的设计与实现[J]. 山东大学学报(理学版), 2017, 52(12): 104-110.
[5] 黄淑芹,徐勇,王平水. 基于概率矩阵分解的用户相似度计算方法及推荐应用[J]. 山东大学学报(理学版), 2017, 52(11): 37-43.
[6] 王亚奇,王静. 考虑好奇心理机制的动态复杂网络谣言传播研究[J]. 山东大学学报(理学版), 2017, 52(6): 99-104.
[7] 陈广瑞,陈兴蜀,王毅桐,葛龙. 一种IaaS多租户环境下虚拟机软件更新服务机制[J]. 山东大学学报(理学版), 2017, 52(3): 60-67.
[8] 庄政茂,陈兴蜀,邵国林,叶晓鸣. 一种时间相关性的异常流量检测模型[J]. 山东大学学报(理学版), 2017, 52(3): 68-73.
[9] 宋元章,李洪雨,陈媛,王俊杰. 基于分形与自适应数据融合的P2P botnet检测方法[J]. 山东大学学报(理学版), 2017, 52(3): 74-81.
[10] 祝升,周斌,朱湘. 综合用户相似性与话题时效性的影响力用户发现算法[J]. 山东大学学报(理学版), 2016, 51(9): 113-120.
[11] 岳猛,吴志军,姜军. 云计算中基于可用带宽欧氏距离的LDoS攻击检测方法[J]. 山东大学学报(理学版), 2016, 51(9): 92-100.
[12] 李宇溪,王恺璇,林慕清,周福才. 基于匿名广播加密的P2P社交网络隐私保护系统[J]. 山东大学学报(理学版), 2016, 51(9): 84-91.
[13] 苏彬庭,许力,方禾,王峰. 基于Diffie-Hellman的无线Mesh网络快速认证机制[J]. 山东大学学报(理学版), 2016, 51(9): 101-105.
[14] 林丽. 基于核心依存图的新闻事件抽取[J]. 山东大学学报(理学版), 2016, 51(9): 121-126.
[15] 高盛祥,余正涛,秦雨,程韵如,庙介璞. 基于随机游走策略的专家关系网络构建[J]. 山东大学学报(理学版), 2016, 51(7): 30-34.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!