Genes and Genomes | |||||||||||||||
Written by Sergi Castellano | |||||||||||||||
| |||||||||||||||
Overview | |||||||||||||||
In what follows, we revised the key concepts regarding the eukaryotic gene and genome structure needed to understand the annotation of genomes. In this context, annotation refers to the description and location of genes and other biologically relevant features on a genomic sequence. Our main goal is to fully comprehend what genome annotation projects offer and, as important, what they do not yet provide. In this regard, it's worth noting that current gene prediction programs, among other bioinformatics tools, systematically ignore the complexity of eukaryotic gene structure. Diversity comes from alternatively spliced gene structures, non-canonical signals (that affect either splicing or translation) and from regions that control gene transcription, promoters, which are not yet well understood. Here, we give and overview to see some of the limitations and future directions in the gene prediction field. | |||||||||||||||
| |||||||||||||||
Genomes | |||||||||||||||
The genome is the genetic material of an organism, that is, the total amount of DNA in the cell. In eukaryotes, it is usually organized into a set of chromosomes, which are extremely long chains of DNA highly condensed. In the picture below, human DNA is shown packaged into chromosome units (as seen during mitotic metaphase). Note the sister chromatids (that contain identical daughter DNA molecules), centromeres and telomeres. Human chromosomes It is time to introduce the three main genome browsers and check which genomes they are serving. In this practical, we will stick to the Ensembl server, but feel free to browse the other later on.
Questions:
Questions:
| |||||||||||||||
| |||||||||||||||
DNA | |||||||||||||||
DNA molecules consist of two anti-parallel chains held together by complementary base pairs that form a double helix. This structure is of major importance for computational analysis and, to a great extend, determines how it can be performed:
From DNA to chromosomes Note that, in this course, we only analyze DNA at the primary sequence level. So, let's now check it. Please, select the chromosome 21 in the karyotype plot.
Questions:
Now, let's search for an specific gene over the whole genome. In the Find window, lookup the "alcohol dehydrogenase" gene.
Questions:
However, before analyzing the alcohol dehydrogenase gene, we will take an overview on the eukaryotic gene structure, processing and expression (see below). | |||||||||||||||
| |||||||||||||||
Gene expression: from DNA to RNA to protein | |||||||||||||||
Transcription, splicing and translation are the main processes that account for gene expression of protein coding genes. Each step is directed by sequence and structural signals. In what follows, we describe from the biological and computational point of view, how these sequence motifs are used to go from DNA to RNA to the final protein product.
The schema below, highlights these processing steps: mRNA processing pathway Locate in the alcohol dehydrogenase gene report the "Prediction Transcript" section. Note the correspondence between levels of reported data and processing steps:
We will follow these links in the order shown, but first, move to the text below for a more precise discussion of the eukaryotic gene structure. | |||||||||||||||
| |||||||||||||||
Transcription | |||||||||||||||
Transcription starts when a region upstream of the gene (promoter region) is activated (bound) by transcription factors. These region, controls whether a gene is transcribed from the forward or reverse strand. In any case, the strand which is actually transcribed is called template or sense strand and the other, nonsense or antisense strand. Promoter region
In short, transcription is the copying of DNA (template strand) to RNA (pre-mRNA). However, when analyzing mRNA, cDNA or EST data, bear in mind that the mRNA to be translated is, in sequence, identical to the coding strand (coding here always refers to translation, and not to transcription). That is, the mRNA is transcribed from the strand that has its complementary sequence. In conclusion, when annotating genomes, genes are annotated in relation to their coding strand. The copy of the template strand | |||||||||||||||
| |||||||||||||||
Gene Structure | |||||||||||||||
Eukaryotic genes are short DNA stretches within a genome with a peculiar and discrete structure. Schematic representation of a two exons eukaryotic gene on a DNA sequence Gene prediction programs make use of this structure to find genes on a genome. Main characteristics are:
Questions:
| |||||||||||||||
| |||||||||||||||
Splicing | |||||||||||||||
Splicing is an RNA-processing step in which introns in the primary transcript are removed. Splicing signals, GT (donor) and AG (acceptor) in the intron region, are used to delimit exon-intron boundaries, so that exons (coding and non-coding ones) are joined together. In this way, the open reading frame sequence along with the 5' and 3' Untranslated Regions (UTRs) are ready to be processed by the ribosome. The spliceosome complex splices out intron sequences Follow the Transcript information link.
Questions:
| |||||||||||||||
| |||||||||||||||
Translation | |||||||||||||||
In translation the mature mRNA sequence into a protein. Again, the ribosomal machinery is guided by several signals along the mRNA sequence to find the right open reading frame (ORF) and to know where translation should terminate. Translation maps RNA to proteins through a 3 to 1 letters code On the other hand, there are three types of transcript data:
Follow the Protein information link.
Questions:
Let's try now to get a more deep insight into the biological role of this gene. We will connect to a couple of web-based resources to learn more about our gene and protein. GeneCards is a database of human genes, their products and their involvement in diseases. Connect to this database and search for symbol/alias the approved HUGO adh symbol: AKR1A1 (as shown in the Ensembl gene report).
Questions:
Finally, we will try to get all the possible identifiers for this gene across several databases. Connect to GeneLynx and do a quick search in human for the HUGO ID: AKR1A1.
Questions:
We will browse the genomic region of the adh gene we are working with. Follow the this link. There are four levels of resolution:
In general, current gene prediction programs cannot predict alternative mRNAs in a reliable way, unless transcriptional data (mRNAs, cDNAs and ESTs) are available.
|