Eukaryotic gene finding with Blast2GO
A basic evaluation of Augustus
To evaluate the performance and accuracy of the gene finding features now available in Blast2GO we choose as test dataset the Chr.1 of Sus scrofa (pig) from the RefSeq database (NCBI). This dataset contains 692 contigs with a total size of 315,321 kb. Additionally, a mapped RNA-seq dataset obtained with Illumina HiSeq 2500 paired-ends from the SRA NCBI's database (SRR3159988) was used to improve the accuracy via intron hints.
Two benchmarks have been created to evaluate the accuracy as well as the performance.
The gene finding has been performed on chromosome 1 of various species with the 'ab initio' and the RNA-seq supported method. For each species, we compared the number of detected genes with the genes present in the RefSeq database. As shown in the figure below both methods overestimated the number of genes. However, with the use of RNA-seq intron/exon information, we could reduce significantly the number of false positives.
Additionally, a Blast search against 'Sus scrofa' (3573 genes, Chr. 1) has been performed in Blast2GO for all 3669 genes prediction via Augustus + RNA-Seq hits. A total of 95.2% has been confirmed (95.8% against NR). This revealed a 4.8% of false positives.
We compared the gene finding process within Blast2GO (CloudSystem) against the local command line version of Augustus (4 2.6 Ghz cores, 16GB RAM). As shown in the figure below a parallel execution of Augustus with Blast2GO reduced the execution time drastically. However, the reduction depends on the number of scaffolds used for the predictions (more scaffolds -> more parallelized predictions -> faster).
The implementation of Augustus within Blast2GO allows performing accurate, state-of-the-art gene finding. It allows to easily reduce false positives, typically observed in ab-initio gene finding methods, via the integration of RNA-seq data. Furthermore, Blast2GO executes the predictions in a parallelized manner in a performant cloud system and saves time and frees local resources. The output generated a Blast2GO project containing the predicted genes as well as a GFF object including the genomic annotations. These datasets can directly be used to proceed with a functional annotation in Blast2GO or be exported in different formats.
Mario Stanke and Burkhard Morgenstern (2005), "AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints", Nucleic Acids Research, 33, W465-W467