Blast2GO Blog

b

De novo genome assembly and annotation of sheath rot fungus Sarocladium oryzae with Blast2GO

This project reproduces a study carried out by Hittalmani S. et al. in 2016 about the S. oryzae fungus which causes sheath rot of rice. We used Blast2GO for the main analysis steps like the structural and functional annotation and analysis.
(Original research paper: https://doi.org/10.1186/s12864-016-2599-0)

 

1.  Introduction 

Sheath rot disease caused by S. oryzae is an emerging threat to rice cultivation. This fungus invades rice through stomata and wounds caused by insects. Upon infection, the fungus engenders necrotic lesions at the mesophyll and vascular tissues, which impede the adequate translocation of nutrients from the foliage to the panicle. This leads to the production of partially filled chaffy grains, affecting the production yield by up to 85%. Nonetheless, despite the threat this fungus entails, little is known about its life cycle and infection biology.

Thereby, as an attempt to attain information about putative genes involved in pathogenicity, Hittalmani S. et al. carried out a de novo genome assembly and annotation of S. oryzae. The goal was to create genomic resources, which could be translated into designing better disease management strategies and widen the understanding of rice-Sarocladium pathosystem.

Sarocladium oryzae



2.  Preprocessing of raw reads 

First, we downloaded the raw reads from the NCBI SRA database (accession number: SRX1639538). Since it was a paired-end library, reads were available in two different fastq files (SRR3233859_1 for one end, and SRR3233859_2 for the other). Read length was 151bp and fragment size ranged from 400 to 600 bp.

FastQ Quality Check of the reads was done with Blast2GO (in Tools > FastQ Tools > FastQ Quality Check) in order to assess read quality. Results showed adapter sequence contamination and reads with low values of Phred score. Therefore, processing was performed to remove low-quality reads and contamination (in Tools > FastQ Tools > FastQ Preprocessing).

Using custom Illumina adapters, we were able to discard adapter contaminants. Moreover, we carried out filtering of reads whose Phred score was less than 23. We also considered sequences with less than 36bp as uninformative, hence they were discarded from our dataset. Figures below illustrate the resulting read qualities and the adapter content of SRR3233859_1.fastq reads. Results from SRR3233859_2.fastq reads are very similar and therefore not shown.

 

Adapter sequence contaminants
Before processing After processing

Phred score quality
Before Preprocessing After Preprocessing

 

3.  Genome Assembly and Functional Annotation 

Once accurate reads had been obtained, genome assembly using SPAdes assembler was performed (v. 3.12.0), being the coverage threshold auto-set. The total genome size was about 32.9 Mbp, with 5,542 scaffolds and an N50 value of 20,743. Average genome coverage obtained was 81.73.

With the attained scaffolds, structural annotation of the genome was carried out, as an attempt to predict genes present in S. oryzae genome. Augustus ab initio eukaryotic gene finding was performed with Blast2GO (in genefind > Eukaryotic GeneFinding). In order to predict genes more accurately, Fusarium graminearum was stated as a closely related species, considering that codon usage bias might be very similar between them. Complete genes were searched in both strands, and the generated output was a protein fasta. Finally, 10,331 proteins were retrieved.

Functional analysis of the retrieved genes was then carried out in Blast2GO. The following steps were:

  • CloudBlast: blastp of our gene sequences against SwissProt database
  • Mapping and annotation, with default parameters, of GO terms associated with each sequence.      
  • InterProScan of sequences against the default databases in order to further obtain new GO terms.
  • Merging of IPS results with annotated ones.

 

Results from the annotation are shown in the right-hand figure. 68% of genes were annotated.

In order to further increase the annotation of our genes, we searched
for COG ortholog groups in Blast2GO (in Analysis > Obtain COG
Ortholog Groups) with the default parameters, taking advantage of
EggNOG database. Found GOs were then merged with the annotation project.
The bar chart illustrates this merging step. As indicated by the red bar,
the amount of annotated GOs after merging has increased by 30.8%,
hence we have acquired more information about the functions of our
set of genes. Simultaneously, 18,978 previously annotated GOs have
been found to be too general, thus discarded (orange bar).

 

4.  Search of pathogenicity genes 

Once the annotation of genes was finished, we proceeded with the identification of genes potentially involved in the pathogenicity of S. oryzae.

To identify putative genes involved in pathogenicity, S. oryzae proteomes were analysed for pathogen-host interaction (PHI) genes, secretory proteins, carbohydrate-active enzymes (CAZymes) and secondary metabolites biosynthetic genes that are required to colonize the host tissue.

 

         4.1  Putative Pathogen-Host Interaction (PHI) Genes

First, the Pathogen-Host interaction database (PHI-base) was downloaded from the internet and blastp against it was carried out. The mission of PHI-base is to provide curated molecular and biological information on genes proven to affect the outcome of pathogen-host interactions. Nonetheless, this database contains experimentally verified virulence-associated genes from bacteria, fungi and oomycetes. Hence blastp results were filtered by fungi species with a minimum of 40% similarity and a minimum of 70% query coverage. A total of 1,282 genes showed significant blast results. These sequences were then selected in the annotation project and an enrichment analysis was carried out with Fisher's exact test (in Analysis > Enrichment Analysis (Fisher's Exact Test)) in Blast2GO (FDR cutoff 0.05). The following generated WordCloud shows that the pathogenesis term was significantly represented in our sequences.

 

 

We assume that these genes are putative candidate genes to induce pathogenicity in S. oryzae, since their role in pathogenesis has been proven previously in other species.

 

          4.2.  Secretome analysis

The cell secretome is a collection of proteins consisting of transmembrane proteins (TM) and proteins secreted by cells into the extracellular space.

We detected transmembrane domains in our genes taking advantage of EMBL-EBI InterProScan against TMHMM database in Blast2GO (in interpro > InterProScan (online) > EMBLE-EBI InterPro). No GO terms were retrieved from this analysis, but InterPro IDs (transmembrane helix). Notwithstanding, we filtered these TMHMM-positive genes, which were already annotated in previously discussed steps and carried out another enrichment analysis with Fisher's Exact Test (FDR cutoff 0.05). As illustrated in the image below, many genes containe transmembrane components.

 

 

In addition, enzyme codes were associated with these sequences (in charts > Make Statistics > Enzyme > Main Enzyme Classes): oxydoreductases and, mainly, hydrolases (image below). Hydrolases Code Distribution (in charts > Make Statistics > Enzyme > Hydrolases) showcased that all hydrolases were peptidases. These results suggest that S. oryzae shows an important capacity to release secretory components in charge of infecting and damaging plant tissues.

 

 

          4.3.  Analysis of Carbohydrate-Active Enzymes (CAZymes)

CAZymes are acknowledged to play a vital role in the metabolism of structural components of the cell wall and the storage of glucans in plant pathogens. They are responsible for breaking down cell wall components of the host, hence leading to a successful infection. Basically, pathogens secrete these enzymes to degrade plant components and use them as a nutrient source [1]. Therefore, in an attempt to elucidate the infectious mechanism of S. oryzae, we have looked for the presence of these enzymes.

By making use of the Gene Ontology webpage, we searched for GO terms related to glycosylases, glycosyltransferases, carbohydrate esterases, carbohydrate-binding molecules and polysaccharide lyases. All of them were found in our annotated results, meaning that S. oryzae may use these enzymes to invade plant hosts.

 

          4.4.  Secondary metabolite production: analysis of helvolic acid and cerulenin biosynthetic pathways

It has been demonstrated that secondary metabolites are essential for fungal growth and development. Secondary metabolites provide protection against several environmental stresses, hence implying an advantage to bear possible hard conditions engendered by the plant being infected [1].

Furthermore, it has been suggested that a concomitant production of helvolic acid and cerulenin might have a synergistic effect to invade the host by changing the cell permeability, leading to leakage of electrolytes in the host tissue.

Taking advantage of the information available in the Gene Ontology Consortium webpage, we compiled the GO terms associated with genes involved in the biosynthesis of these secondary metabolites: nonribosomal peptide synthases, polyketide synthases, prenyltransferases and terpene cyclases. Then, these GO terms were searched in Blast2GO. All of them were found in the annotated project. Moreover, we also used the KEGG pathway database to search possible pathways in our organism implicated in secondary metabolite biosynthesis, also with Blast2GO (in Analysis > Enzyme COG and KEGG > Load Pathway-Maps from KEGG (online)). Interesting pathways were found: polyketide sugar unit biosynthesis (shown below), biosynthesis of terpenoids and steroids and terpenoid backbone biosynthesis.

 

 

5.  Prediction of repeats and SSRs

As a fungal species, repetitive DNA is an integral part of S. oryzae genome. It is important to study the types of repetitive DNA sequences present in its genome since they generate genetic diversity and lead to genome expansion.

S. oryzae scaffold sequences were assessed for the prediction of repeats and SSRs with Repeat Masker (version 4.0.7) using the RepBase database and the default Dfam database. Furthermore, the Microsatellite Identification Tool (MISA) helped in determining SSRs present in our genome.

We run RepeatMasker with Sarocladium zeae as a closely related species. Repetitive DNA was limited to small RNA (0.04%), simple repeats (1.64%) and low complexity reads (0.28%).

MISA detected 16,241 SSRs, from which 61.4% were of just one unit size, 21.4% of two units size, and 13.8% of three units size.

 

6.  Conclusions 

  1. Processing of raw reads was carried out with Blast2GO and an important step to assures data quality for the downstream analysis.
  2. Blast2GO proved to be useful to find genes in the S. oryzae's genome and to functionally annotate them.
  3. The Fisher's Exact Test performed in Blast2GO enabled the identification of enriched functions in gene-sets like pathogenicity-related genes, genes that form part of the cell secretome or genes that codify for hydrolytic enzymes.
  4. GO terms had been used to identify carbohydrate-active enzymes (CAZymes) in the Blast2GO annotated genome. In concordance with the conclusions extracted in the original article, the presence of CAZymes in S. oryzae confirms that this species is capable of degrading cell wall components and infect plant tissues.
  5. Results retrieved from the KEGG pathway tool in Blast2GO reinforce the conclusions indicated in the original research regarding the capacity of S. oryzae to synthesize secondary metabolites involved in the acclimatisation to environmental conditions - a mechanism which represents a survival advantage for S. oryzae.
  6. Results obtain via MISA and RepeatMasker analysis confirmed the observations of Hittalmani S. et al. and its reproducibility demonstrates the robustness of the used analysis pipeline and tools.

 

7.  References 

  1. Hittalmani S, Mahesh HB, Mahadevaiah C, Prasannakumar MK (2016) De novo genome assembly and annotation of rice sheath rot fungus Sarocladium oryzae reveals genes involved in Helvolic acid and cerulenin biosynthesis pathways. BMC Genomics 17:271.
  2. Götz S, et al.. (2008) High throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Research, vol 36, pp. 3420-3435.
  3. Andrews S (2010) FastQC: a quality control tool for high throughput sequence database
  4. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic. Aflexible trimmmer for Illumina Sequence data. Bioinformatics, btu170.
  5. Smit AFA, Hubley R & Green P(2013-2015) RepeatMasker Open-4.0.
  6. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28(1):27-30.
  7. Ashburneret al. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25-9.
  8. GO Consortium (2017) Nucleic Acids Research.
  9. Bankevich A, et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational Biology.
  10. Mario Stanke and Stephan Waack (2003), Gene Prediction with a Hidden-Markov Model and a new Intron Submodel. Bioinformatics, vol 19 suppl 2, pages 215-225.
  11. Huerta-Cepas J (2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Research 44 (D1):D286-D293.
  12. Winnenburg R, et al. (2006), PHI-base: a new database for pathogen host interactions. Nucleic Acids Research 1:34:D459-D464.
  13. Moller S, Croning MDR, Apweiler R (2001). Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17(7):646-653.
  14. Robert Hubley et al. (2016). The Dfam database of repetitive DNA families. Nucleic Acids Research 44:D81-89.
  15. Bao W, et al (2015). Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 6:11.
  16. Beier S, et al. (2017). MISA-web: a web server for microsatellite prediction. Bioinformatics 33(16):2583–2585.

 

 

FORUM

Join our Blast2GO Google Group