山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (03): 20-27.doi: 10.6040/j.issn.1671-9352.3.2014.101
马宇峰, 阮彤
MA Yu-feng, RUAN Tong
摘要: 实体集合扩展是指给定某类别下若干示例作为种子,扩展得到属于该类别下的更多实体。传统的实体集合扩展方法主要考虑实体之间的共现关系,根据它们之间的相似程度进行迭代式的扩展,但这会导致语义偏转问题的出现,准确率较差。对此,提出了先根据LDA(latent dirichlet allocation)主题模型获得种子词集合语义信息,再通过标签传播来进行实体集合扩展的方法。该方法通过考虑实体列表整体蕴含的语义信息,避免了单个词可能带来的歧义问题;利用LDA模型,挖掘实体列表的上下文主题,丰富实体扩展过程中的语义信息,解决语义偏转问题。在实际数据集上取得了良好的检测效果,证明了本文方法的有效性。
中图分类号:
[1] WANG R C, COHEN W W. Language-independent set expansion of named entities using the web[C]// Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07). Piscataway:IEEE, 2007:342-350. [2] WANG R C, COHEN W W. Iterative set expansion of named entities using the web[C]// Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). Piscataway:IEEE, 2008:1091-1096. [3] WANG R C, COHEN W W. Character-level analysis of semi-structured documents for set expansion[C]// Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2009:1503-1512. [4] HE Yeye, DONG Xin. Seisa:set expansion by iterative similarity aggregation[C]// Proceedings of the 20th International Conference on World Wide Web. New York:ACM, 2011:427-436. [5] LI Xiaoli, ZHANG Lei, LIU Bing, et al. Distributional similarity vs. PU learning for entity set expansion[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2010:359-364. [6] QI Zhenyu, LIU Kang, ZHAO Jun. A novel entity set expansion method leveraging entity semantic knowledge[J]. Journal of Chinese Information Processing, 2013, 27(2):1-9. [7] SADAMITSU K, SAITO K, IMAMURA K, et al. Entity set expansion using topic information[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2011:726-731. [8] SADAMITSU K, SAITO K, IMAMURA K, et al. Entity set expansion using interactive topic information[C]// Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation. Somerset:ACL,2012:108-116. [9] JINDAL P, ROTH D. Learning from negative examples in set-expansion[C]// Proceedings of IEEE 11th International Conference on Data Mining. Washington:IEEE Computer Society, 2011:1110-1115. [10] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3:993-1022. [11] ZHU Xiaojin, GHAHRAMANI Zoubin. Learning from labeled and unlabeled data with label propagation[R]. Pittsburgh:Carnegie Mellon University, 2002. [12] ZHANG Huaping, LIU Qun, CHENG Xueqi, et al. Chinese lexical analysis using hierarchical hidden Markov model[C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg:Association for Computational Linguistics, 2003:63-70. [13] WENG Jianshu, LIM E P, JIANG Jing, et al. Twitter rank:finding topic sensitive influential twetterers[C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York:ACM, 2010:261-270. |
[1] | 王立人, 余正涛, 王炎冰, 高盛祥, 李贤慧. 基于有指导LDA用户兴趣模型的微博主题挖掘[J]. 山东大学学报(理学版), 2015, 50(09): 36-41. |
[2] | 郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报(理学版), 2014, 49(11): 74-81. |
[3] | 焦潞林, 彭岩, 林云. 面向网络舆情的文本知识发现算法对比研究[J]. 山东大学学报(理学版), 2014, 49(09): 62-68. |
[4] | 王少鹏, 彭岩, 王洁. 基于LDA的文本聚类在网络舆情分析中的应用研究[J]. 山东大学学报(理学版), 2014, 49(09): 129-134. |
|