山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (12): 23-29.doi: 10.6040/j.issn.1671-9352.3.2014.123
王辉, 陈光
WANG Hui, CHEN Guang
摘要: 针对英文产品方面属性词抽取,提出了一种基于Bootstrapping的抽取方法。该方法利用少数几个种子模板,通过增量迭代的过程发现新的属性词,在每一轮迭代中通过统计技术,结合情感词典的情感词分析,利用属性词与模板的亲密度关系得到属性词被抽取出的概率得分,对候选属性词进行排序过滤。对于抽取后的特征词集利用Wordnet计算属性词间的相似度,根据得分进行聚类,得到产品不同方面的属性词类簇,同时过滤掉得分较低的类簇,进一步去掉噪声。此外还利用种子模板代替种子属性词以提高系统的可移植性。实验结果表明,利用该方法进行产品方面属性词抽取的准确率为0.799,召回率为0.779,调和平均值为0.789,具有较好的抽取性能。
中图分类号:
[1] THET T T, NA J C, KHOO C S. Aspect-based sentiment analysis of movie reviews on discussion boards [J]. Journal of Information Science, 2010, 36(6): 823-848. [2] HU Minjing, LIU Bing. Mining and summarizing customer reviews[C]// Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'04). New York: ACM, 2004: 168-177. [3] RAJU S, PINGALI P, VARMA V. An unsupervised approach to product attribute extraction[C]//Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval. New York: ACM, 2009:796-800. [4] ARUN A, SRINIVASAN P. Automated query generation of Rdbms for informationand knowledge extraction[C]//Proceedings of 2013 International Conference on Information Communication and Embedded Systems. Chennai:IEEE Press,2013: 468-473. [5] MANNAI M. Ben Abdessalem Karaa W. Bayesian information extraction network for medline abstract[C]//Proceedings of 2013 International Conference on Computer and Information Technology (WCCIT).Sousse:IEEE Press,2013: 1-3. [6] PROBST K, GHAI M K R, FANO A, et al. Semi-supervised learning of attribute-value pairs from product description[C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence. Freiburg: IJCAI-INT, 2007:2838-2843. [7] GAMON M, AUE A, OLIVER S, et al. Mining customer opinions fromm text[C]//Proceedings of the 6th International Symposium on Intelligent Data Analysis.[s.1.]:Springer-Verlag, 2005: 897-968. [8] LIMA R, OLIVEIRA H, et al.Information extraction from the web: an ontology-based method using inductive logic programming [J]. Tools with Artificial Intelligence, 2013, 30: 741-748. [9] QIU Guang, LIU Bing, BU Jiajun, et al. Opinion word expansion and target extraction through double propagation [J]. Computational Linguistics, 2011, 37(1): 9-27. [10] 宋乐, 何青青, 王倩,等. 极性相似度计算在词汇倾向性识别中的应用[J].中文信息学报, 2010, 24:63-67. SONG Le, HE Qingqing, WANG Qian,et al. Polarity similarity calculation in terms propensity recognition[J]. Journal of Chinese Information Processing, 2010, 24:63-67. [11] MIAO Gengxin, TATEMURA Junichi, HSIUNG Wangpin, et al. Extracting data records from the web using tag path clustering [C]//Proceedings of International World Wide Web Conference Committee(IW3C2). New York: ACM, 2009: 981-990. [12] Manuel Alvarez, Alberto Pan, Juan Raposo, et al. Using clustering and edit distance techniques for automatic web data extraction [J]. Web Information Systems Engineering, 2007, 4831:212-224. [13] LAKKARAJU H, BHATTACHARYYA C, BHATTACHARYA I. Exploiting coherence for the simultaneous discovery of latent facets and associated sentiments[C]//Proceedings of 2011 SIAM International Conference on Data Mining. Mesa, Arizona, 2011: 498-509. [14] HIROKAWA S. Feature extraction using restricted bootstrapping[C]//Proceedings of 2012 IEEE/ACIS 11th International Conference on Computer and Information Science. Los Alamitos: IEEE Computer Society, 2012:283-288. [15] 栗春亮, 朱艳辉, 徐叶强. 中文产品评论中属性词抽取方法研究[J]. 计算机工程, 2011, 37: 26-29. LI Chunliang, ZHU Yanhui, XU Yeqiang. Research of feature extraction method in Chinese product reviews[J]. Computer Engineering, 2011, 37:26-29. [16] POPESCU A M, ETZIONI O. Extracting product features and opinions from reviews[J]. Natural Language Processing and Text Mining, 2007:2358-2362. [17] JO Y, OH A. Aspect and sentiment unification model for online review analysis[C]//Proceedings of the fourth ACM International Conference on Web Search and Data Mining. New York: ACM, 2010:815-824. [18] CHANG Chia-hui, HSU Chun-nan, LUI Shao-cheng. Automatic information extraction from semi-structured web pages by pattern discovery [J]. Decision Support Systems, 2003(35):129-147. |
[1] | 苏丰龙,谢庆华,黄清泉,邱继远,岳振军. 基于直推式学习的半监督属性抽取[J]. 山东大学学报(理学版), 2016, 51(3): 111-115. |
[2] | 李智恒,杨志豪,林鸿飞. 基于语义的疾病相关蛋白质知识抽取[J]. 山东大学学报(理学版), 2016, 51(3): 104-110. |
[3] | 朱丽萍, 李洪奇, 杨中国, 刘蔷. 一种面向科技文献引言的信息抽取方法[J]. 山东大学学报(理学版), 2015, 50(07): 23-30. |
[4] | 关冕,马军. 针对Web论坛的一种结构化数据自动抽取方法[J]. J4, 2010, 45(5): 42-47. |
[5] | 王 静,姚 勇,刘志镜 . 基于广义隐马尔可夫模型的网页信息抽取方法[J]. J4, 2007, 42(11): 49-52 . |
[6] | 王 雷,陈治平,李志成 . 基于文本分块的多模板隐马尔可夫模型的文本信息抽取[J]. J4, 2006, 41(3): 19-24 . |
|