您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (12): 23-29.doi: 10.6040/j.issn.1671-9352.3.2014.123

• 论文 • 上一篇    下一篇

基于Bootstrapping的英文产品评论属性词抽取方法

王辉, 陈光   

  1. 北京邮电大学模式识别与智能系统实验室, 北京 100876
  • 收稿日期:2014-08-28 修回日期:2014-10-21 出版日期:2014-12-20 发布日期:2014-12-20
  • 作者简介:王辉(1989- ),男,硕士研究生,研究方向为信息抽取、自然语言处理、机器学习、数据挖掘等.E-mail: zhongguoxin.123@gmail.com
  • 基金资助:
    高等学校学科创新引智计划(111计划)项目(B08004);科技重大专项项目(2011ZX03002-005-01);国家自然科学基金资助项目(61273217);博士点基金资助项目(20130005110004)

Feature extraction method based on Bootstrapping in English product comment

WANG Hui, CHEN Guang   

  1. Pattern Recognition and Intelligent System Laboratory, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2014-08-28 Revised:2014-10-21 Online:2014-12-20 Published:2014-12-20

摘要: 针对英文产品方面属性词抽取,提出了一种基于Bootstrapping的抽取方法。该方法利用少数几个种子模板,通过增量迭代的过程发现新的属性词,在每一轮迭代中通过统计技术,结合情感词典的情感词分析,利用属性词与模板的亲密度关系得到属性词被抽取出的概率得分,对候选属性词进行排序过滤。对于抽取后的特征词集利用Wordnet计算属性词间的相似度,根据得分进行聚类,得到产品不同方面的属性词类簇,同时过滤掉得分较低的类簇,进一步去掉噪声。此外还利用种子模板代替种子属性词以提高系统的可移植性。实验结果表明,利用该方法进行产品方面属性词抽取的准确率为0.799,召回率为0.779,调和平均值为0.789,具有较好的抽取性能。

关键词: WordNet, 自扩展, 属性词抽取, 信息抽取

Abstract: An feature extraction method based on Bootstrapping in English product comment was proposed. By this method, starting with a set of extraction patterns as seeds, and then applying an incremental iterative procedure to find new features. During the process of the each iteration, the system ranks the new features by score, which is calculated by the intimacy relationship between the candidate features and patterns. This is useful for prevent topic drift. After extracting features, WordNet is used to calculate the similarity between features. Then clustering the features by the similarity score, get different aspects of the product features, then filtering out the low score of the class clusters, remove noise. What's more, to improve the portability of the system, the seed features are replaced by seed patterns. Experimental results show that extracting features by this method has a good result, the precision, recall and F-measure reach 0.799, 0.779, 0.789 and it has good extraction performance.

Key words: wordnet, bootstrapping, information extraction, feature extraction

中图分类号: 

  • TP391
[1] THET T T, NA J C, KHOO C S. Aspect-based sentiment analysis of movie reviews on discussion boards [J]. Journal of Information Science, 2010, 36(6): 823-848.
[2] HU Minjing, LIU Bing. Mining and summarizing customer reviews[C]// Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'04). New York: ACM, 2004: 168-177.
[3] RAJU S, PINGALI P, VARMA V. An unsupervised approach to product attribute extraction[C]//Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval. New York: ACM, 2009:796-800.
[4] ARUN A, SRINIVASAN P. Automated query generation of Rdbms for informationand knowledge extraction[C]//Proceedings of 2013 International Conference on Information Communication and Embedded Systems. Chennai:IEEE Press,2013: 468-473.
[5] MANNAI M. Ben Abdessalem Karaa W. Bayesian information extraction network for medline abstract[C]//Proceedings of 2013 International Conference on Computer and Information Technology (WCCIT).Sousse:IEEE Press,2013: 1-3.
[6] PROBST K, GHAI M K R, FANO A, et al. Semi-supervised learning of attribute-value pairs from product description[C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence. Freiburg: IJCAI-INT, 2007:2838-2843.
[7] GAMON M, AUE A, OLIVER S, et al. Mining customer opinions fromm text[C]//Proceedings of the 6th International Symposium on Intelligent Data Analysis.[s.1.]:Springer-Verlag, 2005: 897-968.
[8] LIMA R, OLIVEIRA H, et al.Information extraction from the web: an ontology-based method using inductive logic programming [J]. Tools with Artificial Intelligence, 2013, 30: 741-748.
[9] QIU Guang, LIU Bing, BU Jiajun, et al. Opinion word expansion and target extraction through double propagation [J]. Computational Linguistics, 2011, 37(1): 9-27.
[10] 宋乐, 何青青, 王倩,等. 极性相似度计算在词汇倾向性识别中的应用[J].中文信息学报, 2010, 24:63-67. SONG Le, HE Qingqing, WANG Qian,et al. Polarity similarity calculation in terms propensity recognition[J]. Journal of Chinese Information Processing, 2010, 24:63-67.
[11] MIAO Gengxin, TATEMURA Junichi, HSIUNG Wangpin, et al. Extracting data records from the web using tag path clustering [C]//Proceedings of International World Wide Web Conference Committee(IW3C2). New York: ACM, 2009: 981-990.
[12] Manuel Alvarez, Alberto Pan, Juan Raposo, et al. Using clustering and edit distance techniques for automatic web data extraction [J]. Web Information Systems Engineering, 2007, 4831:212-224.
[13] LAKKARAJU H, BHATTACHARYYA C, BHATTACHARYA I. Exploiting coherence for the simultaneous discovery of latent facets and associated sentiments[C]//Proceedings of 2011 SIAM International Conference on Data Mining. Mesa, Arizona, 2011: 498-509.
[14] HIROKAWA S. Feature extraction using restricted bootstrapping[C]//Proceedings of 2012 IEEE/ACIS 11th International Conference on Computer and Information Science. Los Alamitos: IEEE Computer Society, 2012:283-288.
[15] 栗春亮, 朱艳辉, 徐叶强. 中文产品评论中属性词抽取方法研究[J]. 计算机工程, 2011, 37: 26-29. LI Chunliang, ZHU Yanhui, XU Yeqiang. Research of feature extraction method in Chinese product reviews[J]. Computer Engineering, 2011, 37:26-29.
[16] POPESCU A M, ETZIONI O. Extracting product features and opinions from reviews[J]. Natural Language Processing and Text Mining, 2007:2358-2362.
[17] JO Y, OH A. Aspect and sentiment unification model for online review analysis[C]//Proceedings of the fourth ACM International Conference on Web Search and Data Mining. New York: ACM, 2010:815-824.
[18] CHANG Chia-hui, HSU Chun-nan, LUI Shao-cheng. Automatic information extraction from semi-structured web pages by pattern discovery [J]. Decision Support Systems, 2003(35):129-147.
[1] 苏丰龙,谢庆华,黄清泉,邱继远,岳振军. 基于直推式学习的半监督属性抽取[J]. 山东大学学报(理学版), 2016, 51(3): 111-115.
[2] 李智恒,杨志豪,林鸿飞. 基于语义的疾病相关蛋白质知识抽取[J]. 山东大学学报(理学版), 2016, 51(3): 104-110.
[3] 朱丽萍, 李洪奇, 杨中国, 刘蔷. 一种面向科技文献引言的信息抽取方法[J]. 山东大学学报(理学版), 2015, 50(07): 23-30.
[4] 关冕,马军. 针对Web论坛的一种结构化数据自动抽取方法[J]. J4, 2010, 45(5): 42-47.
[5] 王 静,姚 勇,刘志镜 . 基于广义隐马尔可夫模型的网页信息抽取方法[J]. J4, 2007, 42(11): 49-52 .
[6] 王 雷,陈治平,李志成 . 基于文本分块的多模板隐马尔可夫模型的文本信息抽取[J]. J4, 2006, 41(3): 19-24 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!