JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2015, Vol. 50 ›› Issue (03): 6-10.doi: 10.6040/j.issn.1671-9352.3.2014.284

Previous Articles     Next Articles

Weibo new word recognition combining frequency characteristic and accessor variety

ZHOU Chao, YAN Xin, YU Zheng-tao, HONG Xu-dong, XIAN Yan-tuan   

  1. School of Information Engineering and Automation of Computer Science, Kunming University of Science and Technology; Key Lab of Computer Technologies Application of Yunnan Province and Kunming, Kunming 650500, Yunnan, China
  • Received:2014-09-19 Revised:2015-01-16 Online:2015-03-20 Published:2015-03-13

Abstract: Along with the rapid development of Weibo, a lot of new words have appeared. These words have characteristic that spread fast and flexible combination with other words. They are easy to be cut apart into different string in segmentation processing. Therefore a new word recognition method that combines word frequency characteristics and accessor variety was proposed. The first step was to segment the large scale Weibo sentences into words, and then combine the two adjacent strings between stop words. The new word candidate strings could be obtained according to the string frequency of the combination. After the filtration through the word formation rules, the candidate new words would be found. Finally, through the characteristics of the word accessor variety, the garbage string was removed to get the new words. Experiments of new word recognition on COAE 2014 task 3 show that the accuracy can reach 36.5% and this method has a good performance.

Key words: Weibo new words, string frequency statistics, accessor variety, word formation rules

CLC Number: 

  • TP391
[1] LING G C, ASAHARA M, MATSUMOTO Y. Chinese unknown word identification using character-based tagging and chunking[C]//Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2003:197-200.
[2] 周正宇, 李宗葛.一种新的基于统计的词典扩展方法[J].中文信息学报, 2001, 15(5):46-51. ZHOU Zhengyu, LI Zongge. A new statistical method of automatic lexicon augmentation[J]. Journal of Chinese Information Processing, 2001, 15(5):46-51.
[3] WANG Aobo, KAN Min-Yen.Mining informal language from Chinese microtext: joint word recognition and segmentation[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2013:731-741.
[4] 郑家恒, 李文花. 基于构词法的网络新词自动识别初探[J]. 山西大学学报:自然科学版, 2002, 25(2):115-119. ZHENG Jiaheng, LI Wenhua. A study on automatic identification for internet new words according to word-building rule[J].Journal of Shanxi University: Natural Science Edition, 2002, 25(2):115-119.
[5] 崔世起,刘群,孟遥,等.基于大规模语料库的新词检测[J].计算机研究与发展,2006, 43(5):927-932. CUI Shiqi, LIU Qun, MENG Yao, et al. New word detection based on large-scale corpus[J]. Journal of Computer Research and Development, 2006, 43(5):927-932.
[6] 刘建舟, 何婷婷,骆昌日.基于语料库和网络的新词自动识别[J].计算机应用,2004, 24(7):132-134. LIU Jianzhou, HE Tingting, LUO Changri. Automatic new words detection based on corpus and web[J]. Journal of Computer Applications, 2004, 24(7):132-134.
[7] 邹纲,刘洋,刘群,等.面向Internet的中文新词语检测[J]. 中文信息学报,2004,18(6):1-9. ZOU Gang, LIU Yang, LIU Qun, et al. Internet-oriented Chinese new words detection[J]. Journal of Chinese Information Processing, 2004, 18(6):1-9.
[8] 何赛克,王小捷,董远,等.归一化的邻接变化数方法在中文分词中的应用[J].中文信息学报,2010,24(1):15-19. HE Saike, WANG Xiaojie, DONG Yuan, et al.Apply normalized accessor variety in Chinese word segmentation[J]. Journal of Chinese Information Processing, 2010, 24(1):15-19.
[9] FENG Haodi, CHEN Kang, KIT Chunyu, et al. Unsupervised segmentation of Chinese corpus using accessor variety[C]//Proceeding of the 1st International Joint Conference on Natural Language Processing-IJCNLP 2004. Berlin: Springer, 2005:694-703.
[1] GONG Shuang-shuang, CHEN Yu-feng, XU Jin-an, ZHANG Yu-jie. Extraction of Chinese multiword expressions based on Web text [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 40-48.
[2] YU Chuan-ming, ZUO Yu-heng, GUO Ya-jing, AN Lu. Dynamic discovery of authors research interest based on the combined topic evolutional model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 23-34.
[3] . Reader emotion classification with news and comments [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 35-39.
[4] . Design and implementation of topic detection in Russian news based on ontology [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 49-54.
[5] LIAO Xiang-wen, ZHANG Ling-ying, WEI Jing-jing, GUI Lin, CHENG Xue-qi, CHEN Guo-long. User influence analysis of social media with temporal characteristics [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 1-12.
[6] YU Chuan-ming, FENG Bo-lin, TIAN Xin, AN Lu. Deep representative learning based sentiment analysis in the cross-lingual environment [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 13-23.
[7] ZHANG Jun, LI Jing-fei, ZHANG Rui, RUAN Xing-mao, ZHANG Shuo. Community detection algorithm based on effective resistance of network [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 24-29.
[8] PANG Bo, LIU Yuan-chao. Fusion of pointwise and deep learning methods for passage ranking [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 30-35.
[9] CHEN Xin, XUE Yun, LU Xin, LI Wan-li, ZHAO Hong-ya, HU Xiao-hui. Text feature extraction method for sentiment analysis based on order-preserving submatrix and frequent sequential pattern mining [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 36-45.
[10] WANG Tong, MA Yan-zhou, YI Mian-zhu. Speech recognition of Russian short instructions based on DTW [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(11): 29-36.
[11] ZHANG Xiao-dong, DONG Wei-guang, TANG Min-an, GUO Jun-feng, LIANG Jin-ping. gOMP reconstruction algorithm based on generalized Jaccard coefficient for compressed sensing [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(11): 23-28.
[12] SUN Jian-dong, GU Xiu-sen, LI Yan, XU Wei-ran. Chinese entity relation extraction algorithms based on COAE2016 datasets [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 7-12.
[13] WANG Kai, HONG Yu, QIU Ying-ying, WANG Jian, YAO Jian-min, ZHOU Guo-dong. Study on boundary detection of users query intents [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 13-18.
[14] ZHANG Fan, LUO Cheng, LIU Yi-qun, ZHANG Min, MA Shao-ping. User preference prediction in heterogeneous search environment [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 26-34.
[15] YANG Yan, XU Bing, YANG Mu-yun, ZHAO Jing-jing. An emotional classification method based on joint deep learning model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 19-25.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!