An example of coding statistics: Codon usage
by Roderic Guigo, Genis Parra & Enrique Blanco
IMIM, Barcelona

From a chapter published in "Genetic Databases", M.J. Bishop ed., Academic Press, 1999
Unequal usage of codons in the coding regions appears to be a universal feature of the genomes across the phylogenetic spectra. This bias obeys mainly to (i) the uneven usage of the amino acids in the existing proteins and (ii) the uneven usage of synonymous codons. The bias in the usage of the synonymous codons correlates with the abundance of the corresponding tRNAs. The correlation is particularly strong for highly expressed genes. Codon usage is specific of the taxonomic group, and there exist correlation between taxonomic divergence and similarity of codon usage.

By comparing the frequency of codons in a region of an species genome read in a given frame with the typical frequency of codons in the species genes, it is possible to estimate a likelihood of the region coding for a protein in such a frame. Regions in which codons are used with frequencies similar to the typical species codon frequencies are likely to code for genes. This idea was first introduced by Staden and McLahlan staden(1982). In the practice, the likelihood can be computed in a number of different ways. Here we compute it as a log-likelihood ratio. Let $F(c)$ be the frequency (probability) of codon $c$ in the genes of the species under consideration (in other words, $F$ is the codon usage table, see Table 1). Then, given a sequence of codons $C = C_1 C_2 \cdots C_m$, and assuming independence between adjacent codons

\begin{displaymath}
P(C) = F(C_1) F(C_2) \cdots F(C_m)
\end{displaymath}

is the probability of finding the sequence of codons $C$ knowing that $C$ codes for a protein. For instance, if $S$ is the sequence S= AGGACG, when read in frame 1, it results in the sequence $C_1^1 = {\tt AGG}$, $C_2^1 = {\tt ACG}$. Then

\begin{displaymath}
P^1(S)=P(C^1) = F({\tt AGG}) F({\tt ACG})
\end{displaymath}

Substituting the appropriate values from Table 1, we obtain

\begin{displaymath}
P^1(S)=P(C^1) = 0.022 \times 0.038 = 0.000836
\end{displaymath}

On the other hand, let $F_0(c)$ be the frequency of codon $c$ in a non-coding sequence.

\begin{displaymath}P_0(S) = P_0(C) = F_0(C_1) F_0(C_2) \cdots F_0(C_m)\end{displaymath}

is the probability of finding the sequence $S$ if $C$ is non-coding. Assuming the random model of coding DNA, $F_0 (c) = 1/64 = 0.0156$ for all codons, and $P_0$ for the above sequence of codons $C$ would be

\begin{displaymath}
P_0(C)= 0.0156 \times 0.0156 = 0.000244
\end{displaymath}

The log-likelihood ratio for $S$ coding in frame $1$, $LP^1$, is

\begin{displaymath}LP^1(S) = \log (0.000836/0.000244) = \log(3.43) = 0.53\end{displaymath}

The log-likelihood ratios for $S$ coding in frames $2$, and $3$ ($LP^2$ and $LP^3$) are computed in a similar way. As it can be seen, in this case the log-likelihood ratio $LP$ is indeed greater than zero in the coding frame of the exon sequence, while is smaller than zero in the non-coding frames of the exon sequence and in all frames of the intron sequence.

In the practice, the problem is not usually to determine the likelihood that a given sequence is coding or not, but to locate the (usually small) coding regions within large genomic sequences. The typical procedure is to compute the value of a coding statistic in successive (usually overlapping) windows (an sliding window), and record the value of the statistic for each of the windows. This generates a profile along the sequence in which peaks may point to the coding regions and valleys to the non-coding ones. In Figure 1, we plot the result of sliding a window of length 120 bp, the distance between consecutive windows being 10 bp, computing $LP$ in the three different frames, and plotting the highest value obtained. In this case, the resulting profile reproduces fairly well the exonic structure of the human $\beta $-globin gene.


Table 1: The human codon usage and codon preference table as published in Weizmann Institute of Science. For each codon, the table displays the frequency of usage of each codon (per thousand) in human coding regions (first column) and the relative frequency of each codon among synonymous codons (second column).
The Human Codon Usage Table
Gly GGG 17.08 0.23 Arg AGG 12.09 0.22 Trp TGG 14.74 1.00 Arg CGG 10.40 0.19
Gly GGA 19.31 0.26 Arg AGA 11.73 0.21 End TGA 2.64 0.61 Arg CGA 5.63 0.10
Gly GGT 13.66 0.18 Ser AGT 10.18 0.14 Cys TGT 9.99 0.42 Arg CGT 5.16 0.09
Gly GGC 24.94 0.33 Ser AGC 18.54 0.25 Cys TGC 13.86 0.58 Arg CGC 10.82 0.19
Glu GAG 38.82 0.59 Lys AAG 33.79 0.60 End TAG 0.73 0.17 Gln CAG 32.95 0.73
Glu GAA 27.51 0.41 Lys AAA 22.32 0.40 End TAA 0.95 0.22 Gln CAA 11.94 0.27
Asp GAT 21.45 0.44 Asn AAT 16.43 0.44 Tyr TAT 11.80 0.42 His CAT 9.56 0.41
Asp GAC 27.06 0.56 Asn AAC 21.30 0.56 Tyr TAC 16.48 0.58 His CAC 14.00 0.59
Val GTG 28.60 0.48 Met ATG 21.86 1.00 Leu TTG 11.43 0.12 Leu CTG 39.93 0.43
Val GTA 6.09 0.10 Ile ATA 6.05 0.14 Leu TTA 5.55 0.06 Leu CTA 6.42 0.07
Val GTT 10.30 0.17 Ile ATT 15.03 0.35 Phe TTT 15.36 0.43 Leu CTT 11.24 0.12
Val GTC 15.01 0.25 Ile ATC 22.47 0.52 Phe TTC 20.72 0.57 Leu CTC 19.14 0.20
Ala GCG 7.27 0.10 Thr ACG 6.80 0.12 Ser TCG 4.38 0.06 Pro CCG 7.02 0.11
Ala GCA 15.50 0.22 Thr ACA 15.04 0.27 Ser TCA 10.96 0.15 Pro CCA 17.11 0.27
Ala GCT 20.23 0.28 Thr ACT 13.24 0.23 Ser TCT 13.51 0.18 Pro CCT 18.03 0.29
Ala GCC 28.43 0.40 Thr ACC 21.52 0.38 Ser TCC 17.37 0.23 Pro CCC 20.51 0.33


Figure 1: Values of the model based Coding Statistics along the 2000 bp human $\beta $-globin gene sequence, computed on an sliding window of length 120 and step 10.


N E X T