JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2019, Vol. 54 ›› Issue (3): 85-92, 101.doi: 10.6040/j.issn.1671-9352.0.2018.261

•   • Previous Articles     Next Articles

Identification of large intergenic non-coding RNAs using random forest

Wei-na XU*(),Guang-le ZHANG,Shi-hong LI,Yuan-yuan CHEN,Qiang LI,Tao YANG,Ming-min XU,Ning QIAO,Liang-yun ZHANG()   

  1. College of Science, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
  • Received:2018-05-15 Online:2019-03-01 Published:2019-03-19
  • Contact: Wei-na XU E-mail:2015111001@njau.edu.cn;zlyun@njau.edu.cn
  • Supported by:
    国家自然科学基金资助项目(11571173);国家自然科学基金资助项目(11401311);国家自然科学基金资助项目(11601231)

Abstract:

A data source for understanding lincRNAs′ regulatory mechanisms by accurate identification is provided. With the features of minimum free energy and signal-noise ratio, we remove the redundant features by feature contribution. Thus, we develop a machine learning model (random forest) based on random forest algorithm to identify lincRNAs. After inspecting with the same experimental dataset, we prove that the sensitivity, specificity and accuracy of this new method have reached 94.1%, 93.2% and 93.7%, which are higher than the current identification index of the methods of PhyloCSF, LncRNA-ID and CPC. The method proposed in this paper shows better robustness and effective classification.

Key words: long intergenic non-coding RNA, random forests algorithm, minimum free energy, signal-noise ratio

CLC Number: 

  • Q61

Fig.1

Power spectrum curves of lincRNA(A) and mRNA(B)"

Fig.2

The flow diagram of lincRNAs identification"

Fig.3

The distribution of lincRNA (A) and mRNA (B) sequences based on the MFE threshold is -500"

Fig.4

The distribution of lincRNA (A) and mRNA (B) sequences based on the SNR threshold is 3"

Fig.5

Effect of the number of features on the classification accuracy rate of RF model"

Table 1

The classification performance of ELM, SVM and RF based on the feature sets of F1 or not"

Algorithms Sn/% Sp/% ACC/% MCC
ELM 67.3 91.7 79.5 0.608
SVM 89.2 87.1 88.2 0.764
RF 94.1 93.2 93.7 0.873
ELM-F1 65.8 86.1 75.7 0.526
SVM-F1 85.7 84.3 84.8 0.696
RF-F1 86.3 88.0 87.6 0.685

Table 2

The classification performance of different sampling"

Sampling Sn/% Sp/% ACC/% MCC
Under-sampling 94.10 93.20 93.70 0.873
Unsettled 86.18 96.49 93.46 0.832
Over-sampling 94.43 91.32 92.88 0.858

Table 3

The classification performance of different dataset"

Dataset Sn/% Sp/% ACC/% MCC
Balanced training dataset 94.1 93.2 93.7 0.873
Unbalanced training dataset 93.2 91.3 91.7 0.787

Fig.6

Receiver operating characteristic (ROC) curves of three different tools on the same dataset"

Table 4

Performance comparison of our prediction model with other methods for identifying known lincRNAs"

Algorithms Sn/% Sp/% ACC/% MCC
PhyloCSF 84.7 73.2 78.7 0.578
LncRNA-ID 85.8 83.9 84.8 0.698
CPC 86.4 88.1 87.7 0.687
RF 94.1 93.2 93.7 0.873
1 PONTING C , OLIVER P , REIK W . Evolution and functions of long noncoding RNAs[J]. Cell, 2009, 136 (4): 629- 641.
doi: 10.1016/j.cell.2009.02.006
2 CABILI M N , TRAPNELL C , GOFF L , et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses[J]. Genes, 2011, 25 (18): 1915- 1927.
doi: 10.1101/gad.17446611
3 ØROM UA , THOMAS D , MALTE B , et al. Long noncoding RNAs with enhancer-like function in human cells[J]. Cell, 2011, 27 (4): 46- 58.
4 GUTTMAN M , AMIT I , GARBER M , et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals[J]. Nature, 2009, 458 (12): 223- 227.
5 ULITSKY I , SHKUMATAVA A , JAN C H , et al. Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution[J]. Cell, 2011, 147 (7): 1537- 1550.
doi: 10.1016/j.cell.2011.11.055
6 CAO C H , ZHANG D , GUO X . The long intergenic noncoding RNA UFC1, a target of microRNA 34a, interacts with the mRNA stabilizing protein HuR to increase levels of β-catenin in Hcc cells[J]. Gastroenterology, 2015, 148 (2): 415- 426.
doi: 10.1053/j.gastro.2014.10.012
7 翁侠, 洪晓明. LincRNA-PVT1在甲状腺癌组织中的表达及意义[J]. 实用肿瘤杂志, 2017, 32 (1): 57- 61.
WENG Xia , HONG Xiaoming . Expression of lincRNA-PVT1 in thyroid carcinoma and its clinicopathological significance[J]. Journal of Practical Oncology, 2017, 32 (1): 57- 61.
8 TSENG Y Y , MORIARITY B S , GONG W , et al. PVT1 dependence in cancer with MYC copy-number increase[J]. Nature, 2014, 512 (7512): 82- 86.
doi: 10.1038/nature13311
9 PAULI A , RINN J L , SCHIER A F . Non-coding RNAs as regulators of embryo genesis[J]. Nat Rev Genet, 2011, 12 (2): 136- 149.
doi: 10.1038/nrg2904
10 PAULI A , VALEN E , LIN M F , et al. Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis[J]. Genome Res, 2012, 22 (3): 577- 591.
doi: 10.1101/gr.133009.111
11 CABILI M N , TRAPNELL C , GOFF L , et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses[J]. Genes, 2011, 25 (18): 1915- 1927.
doi: 10.1101/gad.17446611
12 SUN K , CHEN X N , JIANG P Y , et al. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data[J]. BMC Genomics, 2013, 14 (S2): 13- 23.
13 施伟, 赵健, 宋晓峰, 等. LincRNA的研究进展[J]. 现代生物医学进展, 2016, 16 (9): 1762- 1765.
SHI Wei , ZHAO Jian , SONG Xiaofeng , et al. Research progress of LincRNA[J]. Progress in Modern Biomedicine, 2016, 16 (9): 1762- 1765.
14 LIN M F , JUNGREIS I , KELLIS M . PhyloCSF:a comparative genomics method to distinguish protein coding and non-coding regions[J]. Bioinformatics, 2011, 27 (13): i275- i282.
doi: 10.1093/bioinformatics/btr209
15 PIAN C , ZHANG G , CHEN Z , et al. LncRNApred:classification of long non-coding RNAs and protein-coding transcripts by the ensemble algorithm with a new hybrid feature[J]. PLOS ONE, 2016, 11 (5): e0154567.
doi: 10.1371/journal.pone.0154567
16 ACHAWANANTAKUN R , CHEN J , SUN Y , et al. LncRNA-ID: long non-coding RNA Identification using balanced random forests[J]. Bioinformatics, 2015, 31 (24): 3897- 390.
17 KONG L , ZHANG Y , YE Z , et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector[J]. Nucleic Acids Res, 2007, 35 (Web Server issue): 345- 349.
18 BU D , YU K , SUN S , et al. NONCODE v3.0:integrative annotation of long noncoding RNAs[J]. Nucleic Acids Res, 2012, 36 (8): 210- 215.
19 SPEIR M L , ZWEIG A S , ROSENBLOOM K R , et al. The UCSC genome browser database:2016 update[J]. Nucleic Acids Res, 2016, 44 (D1): D717.
doi: 10.1093/nar/gkv1275
20 TINOCO I , BORER P N , DENGLER B , et al. Improved estimation of secondary structure in ribonucleic acids[J]. Nat New Biol, 1973, 246 (150): 40- 41.
doi: 10.1038/newbio246040a0
21 BONNET E , WUYTS J , PIERRE Y , et al. Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences[J]. Bioinformatics, 2004, 20 (17): 2911- 2917.
doi: 10.1093/bioinformatics/bth374
22 DING X , ZHU L , JI T , et al. Long intergenic Non-Coding RNAs(LincRNAs) identified by RNA-Seq in breast cancer[J]. PLOS ONE, 2014, 9 (8): e103270.
doi: 10.1371/journal.pone.0103270
23 HUANG T , CHANG H Y . Long noncoding RNA in genome regulation: prospects and mechanisms[J]. RNA Biol, 2010, 7 (5): 582- 585.
doi: 10.4161/rna.7.5.13216
24 YAN M , LIN Z S , ZHANG C T . A new fourier transform approach for protein coding measure based on the format of the Z-curve[J]. Bioinformatics, 1998, 14 (8): 685- 690.
doi: 10.1093/bioinformatics/14.8.685
25 LIU G , LUAN Y . An adaptive integrated algorithm for noninvasive fetal ECG separation and noise reduction based on ICA-EEMD-WS[J]. Med Biol Eng Comput, 2015, 53 (11): 1113- 1127.
doi: 10.1007/s11517-015-1389-1
26 YIN C , YAU S S . Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence[J]. Theor Biol, 2007, 247 (4): 687- 694.
doi: 10.1016/j.jtbi.2007.03.038
27 KAPRANOV P , CHENG J , DIKE S , et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription[J]. Science, 2007, 316 (5830): 1484- 1488.
doi: 10.1126/science.1138341
28 COMPEAU P , PEVZNER P , TESLER G . How to apply de Bruijn graphs to genome assembly[J]. Nat Biotechnology, 2011, 29 (11): 987- 991.
doi: 10.1038/nbt.2023
29 HURST L D , MERCHANT A R . High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes[J]. The Royal Society, 2001, 268 (1466): 493- 497.
doi: 10.1098/rspb.2000.1397
30 FREYHULT E , GARDNER P P , MOULTON V . A comparison of RNA folding measures[J]. BMC Bioinformatics, 2005, 6 (1): 241.
31 SOPHIA S , LEE F , SUN L , et al. EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis[J]. Bioinformatics, 2008, 5 (21): 1603- 1610.
32 ROBIN G , JEAN-MICHEL P , CHRISTINE T M . VSURF: an R package for variable selection using random forests[J]. Computing, 2016, 7 (2): 19- 31.
33 HUANG G B , ZHU Q Y , SIEW C K . Extreme learning machine: a new learning scheme of feed forward neural networks[J]. Proc Int Joint Conf Neural Netw, 2004, 2 (2): 985- 990.
34 VLADIMIR V , CORINNA C . Support-vector networks[J]. Machine Learning, 1995, 20 (3): 273- 297.
35 BREIMAN L . Random forest[J]. Machine Learning, 2001, 45 (1): 5- 32.
doi: 10.1023/A:1010933404324
36 JESSE D , MARK G . The relationship between Precision-Recall and ROC curves[J]. ICML, 2006, 6 (23): 233- 240.
37 ATAPATTU S , TELLAMBURA C , JIANG H , et al. Analysis of area under the ROC curve of energy detection[J]. IEEE Transactions on Wireless Communications, 2010, 9 (3): 1216- 1225.
doi: 10.1109/TWC.2010.03.091085
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHAO Tong-xin1, LIU Lin-de1*, ZHANG Li1, PAN Cheng-chen2, JIA Xing-jun1. Pollinators and pollen polymorphism of  Wisteria sinensis (Sims) Sweet[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 1 -5 .
[2] GUO Lan-lan1,2, GENG Jie1, SHI Shuo1,3, YUAN Fei1, LEI Li1, DU Guang-sheng1*. Computing research of the water hammer pressure in the process of #br# the variable speed closure of valve based on UDF method[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 27 -30 .
[3] LI Min1,2, LI Qi-qiang1. Observer-based sliding mode control of uncertain singular time-delay systems#br#[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 37 -42 .
[4] MENG Xiang-bo1, ZHANG Li-dong1, DU Zi-ping2. Investment and reinsurance strategy for insurers under #br# mean-variance criterion with jumps#br#[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 36 -40 .
[5] LIU Yan-ping, WU Qun-ying. Almost sure limit theorems for the maximum of Gaussian sequences#br# with optimized weight[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 50 -53 .
[6] ZHANG Shen-gui. Multiplicity of solutions for local superlinear p-kirchhoff-type equation#br#[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 61 -68 .
[7] HAN Ya-fei, YI Wen-hui, WANG Wen-bo, WANG Yan-ping, WANG Hua-tian*. Soil bacteria diversity in continuous cropping poplar plantation#br# by high throughput sequencing[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 1 -6 .
[8] ZHANG Ya-dong1, LI Xin-xiang2, SHI Dong-yang3. Superconvergence analysis of a nonconforming finite element for #br# strongly damped wave equations[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 28 -35 .
[9] SHI Kai-quan. P-information law intelligent fusion and soft information #br# image intelligent generation[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(04): 1 -17 .
[10] TIAN You-gong, LIU Zhuan-ling. Extremal distributions for 5-convex stochastic orderings with arbitrary discrete support and applications in actuarial sciences[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(07): 57 -62 .