How to perform a Functional Enrichment Analysis
The main steps of the bioinformatic process to perform a de novo transcriptomic analysis have been described in previous blogs. A common RNA-seq analysis pipeline includes the quality control and preprocessing of raw RNA-seq reads, the structural and functional characterization of the transcriptome, the expression quantification and the differential expression analysis. Although RNA-seq analysis has become a routine procedure in biological research, extracting biological insight from such information is a major challenge.
Transcriptomics technologies results often identify thousands of significant genes, and researches often want to retrieve a functional profile of these significant genes, in order to gain a better understanding of the underlying biological processes. Functional enrichment analysis is a procedure to identify functions that are over-represented in a set of genes and may have an association with an experimental condition (e.g. phenotype, treatment…). These methods use statistical approaches to identify significantly enriched or depleted groups of genes.
Methods to perform a Functional Enrichment Analysis
By combining differential expression results with functional annotations, enrichment analysis can be carried out using two different methods: Fisher’s Exact Test and GSEA.
Fisher’s exact test is a statistical procedure developed by R. A. Fisher in 1935. Fisher’s exact test is a statistical significance test used in the analysis of contingency tables. It is used in combination with a robust False Discovery Rate (FDR) correction for multiple testing. The FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of statistically significant findings, FDR is used to control the expected proportion of incorrectly rejected null hypotheses (“false positives” or “false discoveries”). The FDR method of Benjamini and Hochberg (1995) is used.
It is integrated into the FatiGO software package. FatiGO was designed to extract relevant GO terms for a group of genes with respect to a set of genes of reference (typically the rest of the genome). The group of interesting genes is usually referred to as “test-set”. When performing enrichment to explore differential expression results, this group is usually composed of genes that have been classified as up- or down-regulated when comparing two experimental conditions (e.g. treated and untreated samples). The remaining genes (e.g. rest of the genome) is used as the reference set. The result is a list of statistically significant Gene Ontology terms ranked by their adjusted p-values, and they are classified as over-represented and under-represented.
Blast2GO offers the possibility of applying the Fisher’s Exact Test since it has integrated the FatiGO package. The enrichment analysis can be done in an easy way by providing the functional annotations and the list of gene identifiers for which over-represented functions will be detected. Results can be viewed in several different ways like the tabular format, directly visualized on the Gene Ontology Graph or as a bar chart, always coloring statistically significant terms in red (over-represented) and green (under-represented).
On the other hand, Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes show statistically significant, concordant differences between two biological states (e.g. phenotypes).
In a typical experiment, mRNA expression profiles are generated from a collection of samples belonging to two classes, for example, normal cells versus cancerous cells. The genes can be ordered in a ranked list, according to their differential expression between the classes (e.g. according to their fold change). The GSEA approach allows extracting the meaning of this list.
Given a priori gene sets that have been grouped together (e.g. by their involvement in the same biological pathway, by proximal location on a chromosome or because they share the same GO category), GSEA determines whether members of a gene set tend to occur toward the top or bottom of the list, in which case the gene set is correlated with the phenotypic class distinction. There are three key elements of the GSEA method:
- The calculation of an Enrichment Score that reflects the degree to which a set is overrepresented at the extremes of the ranked list.
- The estimation of the statistical significance of the enrichment score.
- The adjustment for multiple hypothesis testing.
Blast2GO includes the GSEA computational method. For this analysis, the completion (but not exclusively) of the involved sequences with their annotations must be loaded in the application. Ranked list of genes can be selected by uploading text files or ID-Value_list.b2g files containing the lists of sequence IDs and a statistical value for each one. In the framework of differential expression analysis, the fold change (or log2 fold change) statistic is usually used as a ranked value, since it reflects the differences in expression between the two experimental conditions. Once completed, a result table where the adjusted p-values and enrichment scores of each annotation above a given threshold will be shown. Furthermore, results can be visualized on the GO DAG and more charts.
For more information contact us.