基于随机森林的文本分类模型研究

基于随机森林的文本分类模型研究

张华伟,王明文,甘丽新

江西师范大学计算机信息工程学院，江西南昌 330022

收稿日期:2006-03-29 修回日期:1900-01-01 出版日期:2006-10-24 发布日期:2006-10-24
通讯作者: 王明文

Automatic text classification model based on random forest

ZHANG Wei-hua,WANG Ming-wen,GAN Li-xin

School of Computer Information Engineering, Jiangxi Normal Univ., Nanchang 330027, Jiangxi, China

Received:2006-03-29 Revised:1900-01-01 Online:2006-10-24 Published:2006-10-24
Contact: WANG Ming-wen

摘要/Abstract

摘要： 随着WWW的迅猛发展，文本分类成为处理和组织大量文档数据的关键技术.随机森林模型是决策树的集成，并且由一随机向量决定决策树的构造. 当森林中决策树的数目增大，随机森林的泛化误差将趋向一个上界.将随机森林模型应用于文本分类,在Reuter21578数据集上的实验表明，分类效果比较好，性能比较稳定，将其同C4.5, KNN, SM0, SVM 4种典型的文本分类器进行了比较，结果显示它的分类性能胜于C4.5，同KNN, SMO和SVM方法相当.

关键词: 文本分类, 随机森林, 泛化误差 , 决策树

Abstract: Abstract: With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. Random forests are a combination of tree predictors such that each tree depends on the values of random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. In experiments it is compared to C4.5, KNN, SMO and SVM, and the results show that its performance is higher than C4.5 and comparable with KNN, SMO and SVM. It is a promising technique for text categorization.

Key words: generalization error , random forest, text classification

中图分类号:

TP18

张华伟,王明文,甘丽新 . 基于随机森林的文本分类模型研究[J]. J4, 2006, 41(3): 139-143 .

ZHANG Wei-hua,WANG Ming-wen,GAN Li-xin . Automatic text classification model based on random forest[J]. J4, 2006, 41(3): 139-143 .

参考文献

相关文章 15

[1]	李润川,昝红英,申圣亚,毕银龙,张中军. 基于多特征融合的垃圾短信识别[J]. 山东大学学报（理学版）, 2017, 52(7): 73-79.
[2]	万中英,王明文,左家莉,万剑怡. 结合全局和局部信息的特征选择算法[J]. 山东大学学报（理学版）, 2016, 51(5): 87-93.
[3]	马丽菲,莫倩,杜辉. 面向中文短影评的分类技术研究[J]. 山东大学学报（理学版）, 2016, 51(1): 52-57.
[4]	马成龙, 姜亚松, 李艳玲, 张艳, 颜永红. 基于词矢量相似度的短文本分类[J]. 山东大学学报（理学版）, 2014, 49(12): 18-22.
[5]	郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报（理学版）, 2014, 49(11): 74-81.
[6]	刘伍颖,易绵竹,张兴. 一种时空高效的多类别文本分类算法[J]. J4, 2013, 48(11): 99-104.
[7]	蒋盛益1,庞观松2,张建军3. 基于聚类的垃圾邮件识别技术研究[J]. J4, 2011, 46(5): 71-76.
[8]	黄贤立，罗冬梅. 倾向性文本迁移学习中的特征重要性研究[J]. J4, 2010, 45(7): 13-17.
[9]	张雯,张化祥*,李明方,计华. 决策树构建方法:向前两步优于一步[J]. J4, 2010, 45(7): 114-118.
[10]	许传轲陈月辉赵亚欧. 基于改进伪氨基酸组成的蛋白质相互作用预测[J]. J4, 2009, 44(9): 17-21.
[11]	陈雷. 关于通讯过程决策树的几点附注[J]. J4, 2009, 44(1): 33-39 .
[12]	陈雷 . 准同步通讯与泛同步通讯(Ⅱ)[J]. J4, 2008, 43(5): 32-38 .
[13]	何爱香,张勇 . 基于遗传算法和决策树的肿瘤分类规则挖掘[J]. J4, 2007, 42(9): 91-95 .
[14]	亓呈明,郝玲,崔守梅 . 一种新的模糊决策树模型及其应用[J]. J4, 2007, 42(11): 107-109 .
[15]	袁晓航,杜小勇 . iRIPPER——一种改进的基于规则学习的文本分类算法[J]. J4, 2007, 42(11): 66-68 .

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed