A basic evaluation of the Coding-Potential Assessment Tool in Blast2GO
To evaluate the overall performance and different analysis options of the Coding Potential Assessment Tool available in Blast2GO we choose as test dataset the transcripts of Chr.1 of Equus caballus (horse) from the SRA database (NCBI). This dataset contains 455 transcripts assembled with Cufflinks from the transcriptome of equine peripheral blood obtained with Illumina HiSeq 2000 paired-ends.
Fig.1 -Two graph ROC curves for automatic detection of
To assign a to coding potential value to each transcript the tool uses an empirical model built from coding sequences (mRNA and CDS) and non-coding RNA (ncRNA, lncRNA, snoRNA, etc). In the case, the number of available ncRNA sequences is less than necessary it is possible to use sequences from a higher, parent-taxa organism. Blast2GO offers a number of pre-built models for the most commonly used organisms as well as offers the option to create a species-specific model from the NCBI database. A third option allows to directly use local files if coding and non-coding sequences in fasta format to train a model.
To calculate the coding potentials of the "horse"-transcripts it is possible to use a mammal, pre-built model. However, we preferred to build a species-specific model with the sequences from the NCBI keeping the accuracy automatically calculated. Once the model has been created we observed an automatic accuracy of 0.939. Since the automatically assigned assurance seemed low for this model we decided to rerun the tool increasing the accuracy to 0.96. In this case, our sequences will be grouped in 3 categories: coding, unknown and non-coding.
Assigning the accuracy manually we obtain two different cutoffs, a coding cutoff and a non-coding cutoff. With this two cutoff, we obtain the transcripts classified in three different categories:
- : if the coding potential is greater than the coding cutoff.
- : if the coding potential is less than the non-coding cutoff.
- : if the coding potential is between the two cutoffs, the transcript is tagged as a transcript with unknown coding potential.
In order to evaluate the performance of this tool, we create a dataset of well-known protein coding sequences with the 'Load sequences from BioMart' tool, selecting as dataset the first 1000 coding genes of the human (GRCh38.p7) from the 'Ensembl Genes 87. Once we have downloaded the sequences, we run the CPAT tool with the human prebuilt model, obtaining 99.1% as coding sequences. The sequences tagged as 'non-coding' have been blasted and revealed on his hits some a very unspecific Blast hit having a very small HSP with very high similarity.
Once the quality and precision of the model are adjusted, we can evaluate the prediction results.
On one hand, we obtain a "Coding Potential Table" which classifies each transcript with a tag and offers the values of the RNA length, the ORF length, the scores from where derive the coding potential score (Fickett score and Hexamer score) and the coding potential score itself. On the other hand, we can see a pie chart statistics of the coding potential distribution.
To further evaluate the classification transcripts have been blasted against a protein database. As expected, most of the non-coding results have also no Blast results while the coding ones do. However, we also observe transcripts classified as non-coding with Blast results - these might reveal gene expression regulators like ncRNA or miRNAs present in the protein database (also very unspecific Blast hits with a small HSP and high similarity could be aligned). These transcripts should be reviewed.
We also observe several transcripts classified as coding but without any Blast results. This could indicate that these potentially coding transcripts have not yet been added yet to any protein database and therefore might represent novel mRNAs.
1 Wang, L., Park, H. J., Dasari, S., Wang, S., Kocher, J. P., & Li, W. (2013). CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic acids research, 41(6), e74-e74.
2 Wucher, V., Legeai, F., Hedan, B., Rizk, G., Lagoutte, L., Leeb, T., ... & Cirera, S. (2017). FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Research, gkw1306.