DATASETS: "Guigó et al, PNAS, 100(3), 1140-1145, Feb. 4, 2003."

Genome Informatics Research Lab

Resources & Datasets Gene Predictions | Seminars & Courses

IMIM

UPF

CRG

GRIB

DATASETS

Mouse/Human GP

SUPPLEMENTARY MATERIALS FOR

Comparison of mouse and human genomes
followed by experimental verification
yields an estimated 1,019 additional genes

R. Guigó, E. T. Dermitzakis, P. Agarwal, C. P. Ponting,
G. Parra, A. Reymond, J. F. Abril, E. Keibler,
R. Lyle, C. Ucla, S. E. Antonarakis, and Michael R. Brent^*.

PNAS, 100(3):1140-1145 (Feb 4, 2003)
[ PubMed ] [ Abstract ] [ Full Text ]

* To whom correspondence should be adressed.
Email: brent@cs.wustl.edu. Ph: +01 314-935-6621.

Contents

Summary.
Methods:

Gene-prediction based on two genomes comparison.

SGP2: describing the algorithm.
Twinscan: describing the algorithm.

Filtering best candidates.
RT-PCR validation of gene-predictions.
Homology search for predicted proteins.

Datasets:

Whole genome predictions on Mus musculus.
RT-PCR pools.
Predicted genes best candidates.

References.

Summary

A primary motivation for sequencing the mouse genome was to accelerate the discovery of mammalian genes by using sequence conservation between mouse and human to identify coding exons. This proved challenging due to the large proportion of the mouse and human genomes that is apparently conserved but not protein-coding. We developed two programs, SGP2 (Parra et al, 2002) and Twinscan (Korf et al, 2001; Flicek et al, 2002), which can exploit sequence conservation between genomes to identify candidate genes despite the abundance of conserved non-coding sequence. We also developed an enrichment process that selects a subset of highly reliable candidates by exploiting conservation in mouse-human exonic structure. RT-PCR amplification and direct sequencing applied to an initial sample of the predictions that do not overlap previously known genes verified 139 predictions. On average, the confirmed predictions show more restricted expression patterns than the mouse orthologues of the genes on human chromosome 21, and the majority lack both aligned mouse EST sequences and homologues in the fish genomes, demonstrating the sensitivity of SGP2 and Twinscan to hard-to-find genes. We verified 68 novel homologues of known proteins, including two homeobox proteins relevant to developmental biology and an aquaporin. We estimate that 1000 gene predictions that do not overlap known genes can be verified by this method. This is likely to constitute a significant fraction of the previously unknown, multi-exon, mammalian genes.

This page summarizes the methodology applied on each step of the described protocol, it gathers the resulting datasets and it serves as a link to all the relevant documents related with the mouse companion paper.

Methods

GENE-PREDICTION BASED on TWO GENOMES COMPARISON

`SGP2`

SGP2 (/software/sgp2/) is a program to predict genes by comparing anonymous genomic sequences from two different species. In this paper, prediction have been done on the mouse genome (MGSCv3 assembly) using comparative information from the human genome (December,2001 GoldenPath equivalent to NCBI Build 28), both sets of sequences taken from http://genome.ucsc.edu/. To make the predictions, SGP2 combines TBlastX (WU-BLAST version, http://blast.wustle.edu), a sequence similarity search program, with geneid (Guigó et al, 1992; Parra et al, 2000), an "ab initio" gene prediction program. The mouse sequences was cut into 100kb fragments to build the blast database. The masked human chromosomes were also cut in 100kb fragments which were run against the mouse database using TBlastX with the following parameters:

       B=9000   V=9000  hspmax=500  topcomboN=100 
       W=5 matrix=blosum62mod  E=0.01  E2=0.01  
       Z=3000000000  nogaps  filter=xnu+seg  S2=80

Although these parameters increase the speed of the comparison, the whole computation took one week of CPU time using 100 Alpha processors. The resulting high-scoring segment pairs (HSPs) were processed to find the maximum scoring projection. Further information on the HSPs modifications and on the general SGP2 algorithm can be found at: /software/sgp2/algorithm/index.html.
SGP2 was used in a mode in which RefSeqs coordinates (taken from Golden Path M. musculus February 2002 freeze) are given to SGP2, and the predictions are built on top of these RefSeqs. Quimeric predictions including RefSeqs genes are avoided, SGP2 only predicted genes in the regions between known genes.
geneid has essentially no limits to the length of the input query sequence, and deals well with chromosome sequences. Therefore, SGP2 was run with the entire chromosomic sequences (no fragmentation was needed). The predictions were done on the unmasked sequences of the mouse genome (WGSCv3). The computation took one day in a MOSIX cluster containing four PCs (PentiumIII Dual 500Mhz processors).

`TWINSCAN`

The Twinscan method is described in Korf et al, 2001. This paper is freely available online, and can be viewed by clicking here.
Twinscan (http://genes.cs.wustl.edu/) was run on the draft sequence of the mouse genome described in the submitted mouse genome paper and known as MGSCv3. Alignments were produced by comparison to the human assembly known as both NCBI Build 28 and the December, 2001 Golden Path. This human sequence was downloaded from http://genome.ucsc.edu/ , all lowercase masking was converted to N masking, the resulting sequence was further masked with nseg using default parameters, all Ns were removed, and the result was cut into 150kb chunks (subject sequences) from which the Blast database was built. The mouse genome sequence was divided into 1 megabase chunks which were used to query the human blast database with Blastn (WU-BLAST version, http://blast.wustle.edu) using the following parameters:

        M=1 N=-1 Q=5 R=1 Z=3000000000 Y=3000000000 B=10000 V=100
        W=8 X=20 S=15 S2=15 gapS2=30
        lcmask wordmask=seg wordmask=dust topcomboN=3

Twinscan was run using these alignments. The target genome parameters were identical to Genscan parameters (Human.iso) and the conservation parameters were the ones we identify as "68-set-ortholog" (available upon request).

FILTERING BEST CANDIDATES

In this section we describe the protocol to generate a set of SGP2 and Twinscan mouse predictions to be tested by RT-PCR experiments. The goal of this approach is to generate a more reliable set of predicted genes. For SGP2 and Twinscan, the experiment, performed independently within each predictor, consist in:

Identify human mouse orthologous pairs.
Mouse predicted aminoacid sequences were compared with the human predicted aminoacids using Blastp (Altschul et al., 1997). Orthologous pairs were assigned where sequence pairs were aligned with Expect values lower than 1x10e-6.
Discard known genes.
Mouse predictions ovelapping ENSEMBL genes coordinates were considered not novel. We use the preliminary ENSEMBL annotation generated with cDNA RIKEN database. Moreover, to assure the novelty classification, mouse predictions were compared against RefSeq mRNA and ENSEMBL database using Blastn, and predictions sharing more than 95% nucleotide identity over at least 100bp were rejected (considered not novel).
Keep only those pairs for which there is one intron position aligned over the pairwise aminoacid alignment.
A global alignment was produced for every hypothetical orthologous pair. T-coffee (http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html) was run with default parameters to align each pair of aminoacid sequences. Then, the exonic structure was added to the global alignment using exstral.pl (/~rcastelo/exstral.html). This program computes the relative position of the intron boundaries in aligned pairs of sequences. We considered the intron to be aligned when the boundaries of the intron appears in the same coordinates of the alignment in both sequences and the alignment is at least 50% conserved residues at both sides of the aligned intron.

A subset of random predictions was extracted from each set (SGP2 and Twinscan), and two adjacent exons across an intron were chosen from the selected predictions for the RT-PCR test. The experimental setup required that the exons were at least 30bp long, and the introns were at least 1000bp long. Pairs of exons verifying these requirements are sorted by the sum of the scores of the exons, and the top scoring pair was selected for the RT-PCR test.

RT-PCR VALIDATION of GENE-PREDICTIONS

The expression of a subset of the mouse gene models of the HC21 genes was tested by RT-PCR. Total RNA derived from 12 different normal mouse adult tissues (brain, heart, kidney, thymus, liver, stomach, muscle, lung, testis, ovary, skin and eyes) was extracted, retro-transcribed and normalized as previously described (Reymond et al, 2002). The quality of total RNA was tested by PCR using MLH1 primers located at intronic sequences flanking exon 12 (Forward - 5' TGG TGT CTC TAG TTC TGG 3' and Reverse - 5' CAT TGT TGT AGT AGC TCT GC 3'), as an indicator of possible genomic DNA contamination. Primers for RT-PCR were designed the Primer3 program (http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi). Primers were designed from the sequence of distinct exons so that the possible amplification of genomic DNA could be distinguished from cDNA amplification. We chose a single PCR rather than a nested-PCR approach to avoid false positive results due to illegitimate transcription (Kaplan, 1992). Similar amounts of the 12 cDNAs (final dilution 250x) were mixed with JumpStart REDTaq ReadyMix (Sigma) and 4 ng/µl of each of the primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50ºC, annealing temperature of the next 35 cycles was carried out at 50ºC. Amplimers were separated on "Ready to Run" precast gels (Pharmacia). Positives were directly sequenced.

HOMOLOGY SEARCH for PREDICTED PROTEINS

Predicted amino acid sequences were compared with a non-redundant protein sequence database ( ftp://ftp.ncbi.nih.gov/blast/db/nr) using Blastp (Altschul et al., 1997). Homologues were assigned where sequence pairs were aligned with Expect values less than 5x10e-3. These assignments were augmented by further TBlastn sequence comparisons with expressed sequence tag databases (in particular, ftp://ftp.ncbi.nih.gov/blast/db/est_others).

Datasets

WHOLE GENOME PREDICTIONS

This section summarizes all the gene predictions on the mouse genome obtained from its comparison against the human genome. We include the reciprocal human mouse-based gene predictions.

`SGP2`

Results on M. musculus:

Based on H.sapiens Golden Path assembly (December 22, 2001).

Based on H.sapiens Golden Path assembly (December 22, 2001)
and using RefSeq annotation as evidences.

Results on H. sapiens:

Based on M.musculus MGSC version-3 assembly (February, 2002).

Based on M.musculus Sanger Phusion assembly (November 9th, 2001).

`TWINSCAN`

Results on M. musculus: The results are available at http://genes.cs.wustl.edu/mouse .

Results on H. sapiens: Twinscan was also run on the human sequence in exactly the same way except that repeats were not removed from the mouse blast database. The results are available at http://genes.cs.wustl.edu/human.

RT-PCR POOLS

In this section you can access to supplementary data of a fraction of genes selected for the RTPCR experiment, collected in different files linked from the table below. The number of 1019 novel genes given in the paper is in extrapolation from the sucess rates observed in a random sample from each pool. Similarly, the number of total non-redundant predictions in each pool is not the direct sum of the number of predictions by sgp and twinscan, because often sgp and twinscan predict overlapping genes: these have been counted only once in the pools given in the paper. In the following table we have included the sequences and coordinates of all the SGP and twinscan genes and a table with the cooresponding pairs of overlapped genes.

	Whole Prediction Set	RT-PCR Tested Set	RT-PCR Positive Set

Enriched	1428	214	133
	Protein CDS Coordinates Table	Protein CDS Coordinates Selected exon Selected exon coords	Protein CDS Coordinates Selected exon Selected exon coords

Similar	2125	38	4
	Protein CDS Coordinates Table	Protein CDS Coordinates Selected exon Selected exon coords	Protein CDS Coordinates Selected exon Selected exon coords

Other	3659	63	2
	Protein CDS Coordinates Table	Protein CDS Coordinates Selected exon Selected exon coords	Protein CDS Coordinates Selected exon Selected exon coords

You can also get a table for the whole RT-PCR positive genes pool. It contains their Geneva Code and the corresponding Gene-Prediction Identifier together with a link to each gene Summary Datasheet. Just click on this link to retrieve such table.

PREDICTED GENES BEST CANDIDATES

The following links will open a browser window displaying a summary table with all the 476 genes that were submitted for the RT-PCR validation test, classified by mouse chromosome:

Chr1	Chr2	Chr3	Chr4	Chr5
Chr6	Chr7	Chr8	Chr9	Chr10
Chr11	Chr12	Chr13	Chr14	Chr15
Chr16	Chr17	Chr18	Chr19	ChrX
		ChrUN

References

	Altschul et al, 1997		Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res 25(17):3389-402, Sep 1, 1997. [ PubMed ]
	Flicek et al, 2002		Flicek P, Keibler E, Hu P, Korf I and Brent MR. Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny Map. Genome Research 13(1):46-54, 2003. [ PubMed ]
	Guigó et al, 1992		Guigó R, Kundsen S, Drake N and Smith T. Prediction of gene structure. Journal of Molecular Biology 226:141-157, 1992. [ PubMed ]
	Kaplan et al, 1992		Kaplan JC, Kahn A and Chelly J. Illegitimate transcription: its use in the study of inherited disease. Human Mutation 1(5):357-360, 1992. [ PubMed ]
	Korf et al, 2001		Korf I, Flicek P, Duan D and Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics 17(Suppl 1): S140-148, 2001. [ PubMed ]
	Parra et al, 2000		Parra G, Blanco E and Guigó R. GeneID in Drosophila. Genome Research 10(4):511-515, 2000. [ PubMed ]
	Parra et al, 2002		Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW and Guigó R. Comparative gene prediction in human and mouse. Genome Research 13(1):108-117, 2003. [ PubMed ]
	Reymond et al, 2002		Reymond A, Marigo V, Yaylaoglu MB, Leoni A, Ucla C, Scamuffa N, Caccioppoli C, Dermitzakis ET, Lyle R, Banfi S, Eichele G, Antonarakis SE and Ballabio A. Human chromosome 21 gene expression atlas in the mouse. Nature 420(6915):578-582, 2002. [ PubMed ]

Disclaimer

webmaster