您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2019, Vol. 54 ›› Issue (3): 85-92, 101.doi: 10.6040/j.issn.1671-9352.0.2018.261

•   • 上一篇    下一篇

基于随机森林算法识别基因间长非编码RNA

徐炜娜*(),张广乐,李仕红,陈园园,李强,杨涛,许明敏,乔宁,张良云()   

  1. 南京农业大学理学院, 江苏 南京 210095
  • 收稿日期:2018-05-15 出版日期:2019-03-01 发布日期:2019-03-19
  • 通讯作者: 徐炜娜 E-mail:2015111001@njau.edu.cn;zlyun@njau.edu.cn
  • 作者简介:张良云(1965—),男,博士,教授,研究方向为计算生物信息学. E-mail: zlyun@njau.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(11571173);国家自然科学基金资助项目(11401311);国家自然科学基金资助项目(11601231)

Identification of large intergenic non-coding RNAs using random forest

Wei-na XU*(),Guang-le ZHANG,Shi-hong LI,Yuan-yuan CHEN,Qiang LI,Tao YANG,Ming-min XU,Ning QIAO,Liang-yun ZHANG()   

  1. College of Science, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
  • Received:2018-05-15 Online:2019-03-01 Published:2019-03-19
  • Contact: Wei-na XU E-mail:2015111001@njau.edu.cn;zlyun@njau.edu.cn
  • Supported by:
    国家自然科学基金资助项目(11571173);国家自然科学基金资助项目(11401311);国家自然科学基金资助项目(11601231)

摘要:

为了深入了解和探索lincRNA的调控机制,建立了lincRNA高效识别模型,有助于为后续研究提供数据源。依据最小自由能(minimum free energy, MFE)和信噪比(signal-noise ratio, SNR)等特征,并通过特征贡献度大小剔除冗余特征,构建随机森林(random forest, RF)分类模型,有效地识别lincRNAs。经检验,模型的灵敏度、特异性和精确度分别达到94.1%、93.2%和93.7%,高于现有PhyloCSF、LncRNA-ID和CPC方法的各项识别指标。模型在识别过程中表现出较好的鲁棒性,可准确识别lincRNA。

关键词: 基因间长非编码RNA, 随机森林算法, 最小自由能, 信噪比

Abstract:

A data source for understanding lincRNAs′ regulatory mechanisms by accurate identification is provided. With the features of minimum free energy and signal-noise ratio, we remove the redundant features by feature contribution. Thus, we develop a machine learning model (random forest) based on random forest algorithm to identify lincRNAs. After inspecting with the same experimental dataset, we prove that the sensitivity, specificity and accuracy of this new method have reached 94.1%, 93.2% and 93.7%, which are higher than the current identification index of the methods of PhyloCSF, LncRNA-ID and CPC. The method proposed in this paper shows better robustness and effective classification.

Key words: long intergenic non-coding RNA, random forests algorithm, minimum free energy, signal-noise ratio

中图分类号: 

  • Q61

图1

lincRNA(A)和mRNA(B)频谱图"

图2

lincRNA识别流程图"

图3

以MFE=-500为阈值的lincRNA与mRNA样本序列分布图(A:lincRNA序列; B:mRNA序列)"

图4

以SNR=3为阈值的lincRNA与mRNA样本序列分布图(A:lincRNA序列; B:mRNA序列)"

图5

不同特征个数对RF分类模型性能的影响"

表1

ELM、SVM和RF基于有无特征集F1的分类结果"

Algorithms Sn/% Sp/% ACC/% MCC
ELM 67.3 91.7 79.5 0.608
SVM 89.2 87.1 88.2 0.764
RF 94.1 93.2 93.7 0.873
ELM-F1 65.8 86.1 75.7 0.526
SVM-F1 85.7 84.3 84.8 0.696
RF-F1 86.3 88.0 87.6 0.685

表2

不同采样方法下的分类结果"

Sampling Sn/% Sp/% ACC/% MCC
Under-sampling 94.10 93.20 93.70 0.873
Unsettled 86.18 96.49 93.46 0.832
Over-sampling 94.43 91.32 92.88 0.858

表3

不同数据集的分类结果"

Dataset Sn/% Sp/% ACC/% MCC
Balanced training dataset 94.1 93.2 93.7 0.873
Unbalanced training dataset 93.2 91.3 91.7 0.787

图6

3种分类器的ROC曲线"

表4

不同分类模型下的lincRNA识别性能比较"

Algorithms Sn/% Sp/% ACC/% MCC
PhyloCSF 84.7 73.2 78.7 0.578
LncRNA-ID 85.8 83.9 84.8 0.698
CPC 86.4 88.1 87.7 0.687
RF 94.1 93.2 93.7 0.873
1 PONTING C , OLIVER P , REIK W . Evolution and functions of long noncoding RNAs[J]. Cell, 2009, 136 (4): 629- 641.
doi: 10.1016/j.cell.2009.02.006
2 CABILI M N , TRAPNELL C , GOFF L , et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses[J]. Genes, 2011, 25 (18): 1915- 1927.
doi: 10.1101/gad.17446611
3 ØROM UA , THOMAS D , MALTE B , et al. Long noncoding RNAs with enhancer-like function in human cells[J]. Cell, 2011, 27 (4): 46- 58.
4 GUTTMAN M , AMIT I , GARBER M , et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals[J]. Nature, 2009, 458 (12): 223- 227.
5 ULITSKY I , SHKUMATAVA A , JAN C H , et al. Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution[J]. Cell, 2011, 147 (7): 1537- 1550.
doi: 10.1016/j.cell.2011.11.055
6 CAO C H , ZHANG D , GUO X . The long intergenic noncoding RNA UFC1, a target of microRNA 34a, interacts with the mRNA stabilizing protein HuR to increase levels of β-catenin in Hcc cells[J]. Gastroenterology, 2015, 148 (2): 415- 426.
doi: 10.1053/j.gastro.2014.10.012
7 翁侠, 洪晓明. LincRNA-PVT1在甲状腺癌组织中的表达及意义[J]. 实用肿瘤杂志, 2017, 32 (1): 57- 61.
WENG Xia , HONG Xiaoming . Expression of lincRNA-PVT1 in thyroid carcinoma and its clinicopathological significance[J]. Journal of Practical Oncology, 2017, 32 (1): 57- 61.
8 TSENG Y Y , MORIARITY B S , GONG W , et al. PVT1 dependence in cancer with MYC copy-number increase[J]. Nature, 2014, 512 (7512): 82- 86.
doi: 10.1038/nature13311
9 PAULI A , RINN J L , SCHIER A F . Non-coding RNAs as regulators of embryo genesis[J]. Nat Rev Genet, 2011, 12 (2): 136- 149.
doi: 10.1038/nrg2904
10 PAULI A , VALEN E , LIN M F , et al. Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis[J]. Genome Res, 2012, 22 (3): 577- 591.
doi: 10.1101/gr.133009.111
11 CABILI M N , TRAPNELL C , GOFF L , et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses[J]. Genes, 2011, 25 (18): 1915- 1927.
doi: 10.1101/gad.17446611
12 SUN K , CHEN X N , JIANG P Y , et al. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data[J]. BMC Genomics, 2013, 14 (S2): 13- 23.
13 施伟, 赵健, 宋晓峰, 等. LincRNA的研究进展[J]. 现代生物医学进展, 2016, 16 (9): 1762- 1765.
SHI Wei , ZHAO Jian , SONG Xiaofeng , et al. Research progress of LincRNA[J]. Progress in Modern Biomedicine, 2016, 16 (9): 1762- 1765.
14 LIN M F , JUNGREIS I , KELLIS M . PhyloCSF:a comparative genomics method to distinguish protein coding and non-coding regions[J]. Bioinformatics, 2011, 27 (13): i275- i282.
doi: 10.1093/bioinformatics/btr209
15 PIAN C , ZHANG G , CHEN Z , et al. LncRNApred:classification of long non-coding RNAs and protein-coding transcripts by the ensemble algorithm with a new hybrid feature[J]. PLOS ONE, 2016, 11 (5): e0154567.
doi: 10.1371/journal.pone.0154567
16 ACHAWANANTAKUN R , CHEN J , SUN Y , et al. LncRNA-ID: long non-coding RNA Identification using balanced random forests[J]. Bioinformatics, 2015, 31 (24): 3897- 390.
17 KONG L , ZHANG Y , YE Z , et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector[J]. Nucleic Acids Res, 2007, 35 (Web Server issue): 345- 349.
18 BU D , YU K , SUN S , et al. NONCODE v3.0:integrative annotation of long noncoding RNAs[J]. Nucleic Acids Res, 2012, 36 (8): 210- 215.
19 SPEIR M L , ZWEIG A S , ROSENBLOOM K R , et al. The UCSC genome browser database:2016 update[J]. Nucleic Acids Res, 2016, 44 (D1): D717.
doi: 10.1093/nar/gkv1275
20 TINOCO I , BORER P N , DENGLER B , et al. Improved estimation of secondary structure in ribonucleic acids[J]. Nat New Biol, 1973, 246 (150): 40- 41.
doi: 10.1038/newbio246040a0
21 BONNET E , WUYTS J , PIERRE Y , et al. Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences[J]. Bioinformatics, 2004, 20 (17): 2911- 2917.
doi: 10.1093/bioinformatics/bth374
22 DING X , ZHU L , JI T , et al. Long intergenic Non-Coding RNAs(LincRNAs) identified by RNA-Seq in breast cancer[J]. PLOS ONE, 2014, 9 (8): e103270.
doi: 10.1371/journal.pone.0103270
23 HUANG T , CHANG H Y . Long noncoding RNA in genome regulation: prospects and mechanisms[J]. RNA Biol, 2010, 7 (5): 582- 585.
doi: 10.4161/rna.7.5.13216
24 YAN M , LIN Z S , ZHANG C T . A new fourier transform approach for protein coding measure based on the format of the Z-curve[J]. Bioinformatics, 1998, 14 (8): 685- 690.
doi: 10.1093/bioinformatics/14.8.685
25 LIU G , LUAN Y . An adaptive integrated algorithm for noninvasive fetal ECG separation and noise reduction based on ICA-EEMD-WS[J]. Med Biol Eng Comput, 2015, 53 (11): 1113- 1127.
doi: 10.1007/s11517-015-1389-1
26 YIN C , YAU S S . Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence[J]. Theor Biol, 2007, 247 (4): 687- 694.
doi: 10.1016/j.jtbi.2007.03.038
27 KAPRANOV P , CHENG J , DIKE S , et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription[J]. Science, 2007, 316 (5830): 1484- 1488.
doi: 10.1126/science.1138341
28 COMPEAU P , PEVZNER P , TESLER G . How to apply de Bruijn graphs to genome assembly[J]. Nat Biotechnology, 2011, 29 (11): 987- 991.
doi: 10.1038/nbt.2023
29 HURST L D , MERCHANT A R . High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes[J]. The Royal Society, 2001, 268 (1466): 493- 497.
doi: 10.1098/rspb.2000.1397
30 FREYHULT E , GARDNER P P , MOULTON V . A comparison of RNA folding measures[J]. BMC Bioinformatics, 2005, 6 (1): 241.
31 SOPHIA S , LEE F , SUN L , et al. EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis[J]. Bioinformatics, 2008, 5 (21): 1603- 1610.
32 ROBIN G , JEAN-MICHEL P , CHRISTINE T M . VSURF: an R package for variable selection using random forests[J]. Computing, 2016, 7 (2): 19- 31.
33 HUANG G B , ZHU Q Y , SIEW C K . Extreme learning machine: a new learning scheme of feed forward neural networks[J]. Proc Int Joint Conf Neural Netw, 2004, 2 (2): 985- 990.
34 VLADIMIR V , CORINNA C . Support-vector networks[J]. Machine Learning, 1995, 20 (3): 273- 297.
35 BREIMAN L . Random forest[J]. Machine Learning, 2001, 45 (1): 5- 32.
doi: 10.1023/A:1010933404324
36 JESSE D , MARK G . The relationship between Precision-Recall and ROC curves[J]. ICML, 2006, 6 (23): 233- 240.
37 ATAPATTU S , TELLAMBURA C , JIANG H , et al. Analysis of area under the ROC curve of energy detection[J]. IEEE Transactions on Wireless Communications, 2010, 9 (3): 1216- 1225.
doi: 10.1109/TWC.2010.03.091085
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 赵同欣1,刘林德1*,张莉1,潘成臣2,贾兴军1. 紫藤传粉昆虫与花粉多型性研究[J]. 山东大学学报(理学版), 2014, 49(03): 1 -5 .
[2] 郭兰兰1,2,耿介1,石硕1,3,苑飞1,雷丽1,杜广生1*. 基于UDF方法的阀门变速关闭过程中的#br# 水击压强计算研究[J]. 山东大学学报(理学版), 2014, 49(03): 27 -30 .
[3] 李敏1,2,李歧强1. 不确定奇异时滞系统的观测器型滑模控制器[J]. 山东大学学报(理学版), 2014, 49(03): 37 -42 .
[4] 孟祥波1,张立东1,杜子平2. 均值-方差标准下带跳的保险公司投资与再保险策略[J]. 山东大学学报(理学版), 2014, 49(05): 36 -40 .
[5] 刘艳萍,吴群英. 优化权重下高斯序列最大值几乎处处中心极限定理[J]. 山东大学学报(理学版), 2014, 49(05): 50 -53 .
[6] 张申贵. 局部超线性p-基尔霍夫方程的多重解[J]. 山东大学学报(理学版), 2014, 49(05): 61 -68 .
[7] 韩亚飞,伊文慧,王文波,王延平,王华田*. 基于高通量测序技术的连作杨树人工林土壤细菌多样性研究[J]. 山东大学学报(理学版), 2014, 49(05): 1 -6 .
[8] 张亚东1,李新祥2,石东洋3. 强阻尼波动方程的非协调有限元超收敛分析[J]. 山东大学学报(理学版), 2014, 49(05): 28 -35 .
[9] 史开泉. 信息规律智能融合与软信息图像智能生成[J]. 山东大学学报(理学版), 2014, 49(04): 1 -17 .
[10] 田有功, 刘转玲. 任意支撑上5阶凸随机序的极值分布及其在保险精算中的应用[J]. 山东大学学报(理学版), 2014, 49(07): 57 -62 .