您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (03): 20-27.doi: 10.6040/j.issn.1671-9352.3.2014.101

• 论文 • 上一篇    下一篇

基于LDA及标签传播的实体集合扩展

马宇峰, 阮彤   

  1. 华东理工大学计算机科学与工程系, 上海 200237
  • 收稿日期:2014-08-28 修回日期:2014-11-25 出版日期:2015-03-20 发布日期:2015-03-13
  • 通讯作者: 阮彤(1973- ),女,博士,副教授,研究方向为自然语言处理、数据挖掘.E-mail:ruantong@ecust.edu.cn E-mail:ruantong@ecust.edu.cn
  • 作者简介:马宇峰(1990- ),男,硕士研究生,研究方向为自然语言处理、数据挖掘.E-mail:mafing@qq.com
  • 基金资助:
    国家科技支撑计划项目(2013BAH11F03)

Entity set expansion based on LDA and label propagation

MA Yu-feng, RUAN Tong   

  1. Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2014-08-28 Revised:2014-11-25 Online:2015-03-20 Published:2015-03-13

摘要: 实体集合扩展是指给定某类别下若干示例作为种子,扩展得到属于该类别下的更多实体。传统的实体集合扩展方法主要考虑实体之间的共现关系,根据它们之间的相似程度进行迭代式的扩展,但这会导致语义偏转问题的出现,准确率较差。对此,提出了先根据LDA(latent dirichlet allocation)主题模型获得种子词集合语义信息,再通过标签传播来进行实体集合扩展的方法。该方法通过考虑实体列表整体蕴含的语义信息,避免了单个词可能带来的歧义问题;利用LDA模型,挖掘实体列表的上下文主题,丰富实体扩展过程中的语义信息,解决语义偏转问题。在实际数据集上取得了良好的检测效果,证明了本文方法的有效性。

关键词: 实体集合扩展, 标签传播, LDA, 种子词, 主题模型

Abstract: Set expansion refers to expanding a partial set of "seed" objects into a more complete set. A widely employed approach to set expansion is based on iterative bootstrapping, which can be applied with only small amounts of supervision and which scales bad to very large corpus. A well-known problem with iterative bootstrapping is a phenomenon known as semantic drift: as bootstrapping proceeds it is likely that unreliable patterns will lead to false extractions. To address this issue, a hybrid method for entity set expansion was proposed based on LDA and label propagation. The whole entities in an entity list were considered to prevent words ambiguity; and the LDA used model to mine semantic information in contexts between entity lists to resolve the semantic drift phenomenon. Experiments were conducted with some datasets, and the evaluation demonstrates the effectiveness, efficiency, and scalability of the proposed solution.

Key words: topic model, seed, LDA, label propagation, entity set expansion

中图分类号: 

  • TP391
[1] WANG R C, COHEN W W. Language-independent set expansion of named entities using the web[C]// Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07). Piscataway:IEEE, 2007:342-350.
[2] WANG R C, COHEN W W. Iterative set expansion of named entities using the web[C]// Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). Piscataway:IEEE, 2008:1091-1096.
[3] WANG R C, COHEN W W. Character-level analysis of semi-structured documents for set expansion[C]// Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2009:1503-1512.
[4] HE Yeye, DONG Xin. Seisa:set expansion by iterative similarity aggregation[C]// Proceedings of the 20th International Conference on World Wide Web. New York:ACM, 2011:427-436.
[5] LI Xiaoli, ZHANG Lei, LIU Bing, et al. Distributional similarity vs. PU learning for entity set expansion[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2010:359-364.
[6] QI Zhenyu, LIU Kang, ZHAO Jun. A novel entity set expansion method leveraging entity semantic knowledge[J]. Journal of Chinese Information Processing, 2013, 27(2):1-9.
[7] SADAMITSU K, SAITO K, IMAMURA K, et al. Entity set expansion using topic information[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2011:726-731.
[8] SADAMITSU K, SAITO K, IMAMURA K, et al. Entity set expansion using interactive topic information[C]// Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation. Somerset:ACL,2012:108-116.
[9] JINDAL P, ROTH D. Learning from negative examples in set-expansion[C]// Proceedings of IEEE 11th International Conference on Data Mining. Washington:IEEE Computer Society, 2011:1110-1115.
[10] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3:993-1022.
[11] ZHU Xiaojin, GHAHRAMANI Zoubin. Learning from labeled and unlabeled data with label propagation[R]. Pittsburgh:Carnegie Mellon University, 2002.
[12] ZHANG Huaping, LIU Qun, CHENG Xueqi, et al. Chinese lexical analysis using hierarchical hidden Markov model[C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg:Association for Computational Linguistics, 2003:63-70.
[13] WENG Jianshu, LIM E P, JIANG Jing, et al. Twitter rank:finding topic sensitive influential twetterers[C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York:ACM, 2010:261-270.
[1] 王立人, 余正涛, 王炎冰, 高盛祥, 李贤慧. 基于有指导LDA用户兴趣模型的微博主题挖掘[J]. 山东大学学报(理学版), 2015, 50(09): 36-41.
[2] 郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报(理学版), 2014, 49(11): 74-81.
[3] 焦潞林, 彭岩, 林云. 面向网络舆情的文本知识发现算法对比研究[J]. 山东大学学报(理学版), 2014, 49(09): 62-68.
[4] 王少鹏, 彭岩, 王洁. 基于LDA的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报(理学版), 2014, 49(09): 129-134.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!