|
SUPPLEMENTARY MATERIALS FOR
Comparison of mouse and human genomes
followed by experimental verification
yields an estimated 1,019 additional genes
R. Guigó, E. T. Dermitzakis, P. Agarwal, C. P. Ponting,
G. Parra, A. Reymond, J. F. Abril, E. Keibler,
R. Lyle, C. Ucla, S. E. Antonarakis, and Michael R. Brent *.
PNAS, 100(3):1140-1145 (Feb 4, 2003)
[ PubMed ]
[ Abstract ]
[ Full Text ]
*
To whom correspondence should be adressed.
Email: brent@cs.wustl.edu. Ph: +01 314-935-6621.
Summary
|
|
A primary motivation for sequencing the mouse genome was to accelerate the discovery of mammalian genes by using sequence conservation between mouse and human to identify coding exons. This proved challenging due to the large proportion of the mouse and human genomes that is apparently conserved but not protein-coding. We developed two programs, SGP2 (Parra et al, 2002) and Twinscan (Korf et al, 2001; Flicek et al, 2002), which can exploit sequence conservation between genomes to identify candidate genes despite the abundance of conserved non-coding sequence. We also developed an enrichment process that selects a subset of highly reliable candidates by exploiting conservation in mouse-human exonic structure. RT-PCR amplification and direct sequencing applied to an initial sample of the predictions that do not overlap previously known genes verified 139 predictions. On average, the confirmed predictions show more restricted expression patterns than the mouse orthologues of the genes on human chromosome 21, and the majority lack both aligned mouse EST sequences and homologues in the fish genomes, demonstrating the sensitivity of SGP2 and Twinscan to hard-to-find genes. We verified 68 novel homologues of known proteins, including two homeobox proteins relevant to developmental biology and an aquaporin. We estimate that 1000 gene predictions that do not overlap known genes can be verified by this method. This is likely to constitute a significant fraction of the previously unknown, multi-exon, mammalian genes.
This page summarizes the methodology applied on each step of the described protocol, it gathers the resulting datasets and it serves as a link to all the relevant documents related with the mouse companion paper.
Methods
|
|
GENE-PREDICTION BASED on TWO GENOMES COMPARISON
SGP2
SGP2 (/software/sgp2/) is a program to predict
genes by comparing anonymous genomic sequences from two different
species. In this paper, prediction have been done on the mouse genome
(MGSCv3 assembly) using comparative information from the human genome
(December,2001 GoldenPath equivalent to NCBI Build 28), both sets of
sequences taken from http://genome.ucsc.edu/. To make
the predictions, SGP2 combines TBlastX
(WU-BLAST version,
http://blast.wustle.edu), a sequence similarity search program,
with geneid (Guigó et al, 1992;
Parra et al, 2000), an "ab initio" gene
prediction program. The mouse sequences was cut into 100kb fragments
to build the blast database. The masked human chromosomes were also
cut in 100kb fragments which were run against the mouse database using
TBlastX with the following parameters:
B=9000 V=9000 hspmax=500 topcomboN=100
W=5 matrix=blosum62mod E=0.01 E2=0.01
Z=3000000000 nogaps filter=xnu+seg S2=80
Although these parameters increase the speed of the comparison, the
whole computation took one week of CPU time using 100 Alpha
processors. The resulting high-scoring segment pairs (HSPs) were
processed to find the maximum scoring projection. Further information
on the HSPs modifications and on the general SGP2 algorithm
can be found at:
/software/sgp2/algorithm/index.html.
SGP2 was used in a mode in which RefSeqs coordinates (taken
from
Golden Path M. musculus February 2002 freeze)
are given to SGP2, and the predictions are built on top of
these RefSeqs. Quimeric predictions including RefSeqs genes are
avoided, SGP2 only predicted genes in the regions between
known genes.
geneid has essentially no limits to the length of the input
query sequence, and deals well with chromosome sequences. Therefore,
SGP2 was run with the entire chromosomic sequences (no
fragmentation was needed). The predictions were done on the unmasked
sequences of the mouse genome (WGSCv3). The computation took one day
in a MOSIX cluster containing four PCs (PentiumIII Dual 500Mhz
processors).
TWINSCAN
The Twinscan method is described in Korf et al, 2001. This paper is freely available online, and can be viewed by clicking here.
Twinscan (http://genes.cs.wustl.edu/) was run on the draft sequence of the mouse genome
described in the submitted mouse genome paper and known as
MGSCv3. Alignments were produced by comparison to the human assembly
known as both NCBI Build 28 and the December, 2001 Golden Path. This
human sequence was downloaded from http://genome.ucsc.edu/ , all
lowercase masking was converted to N masking, the resulting sequence
was further masked with nseg using default parameters, all Ns were
removed, and the result was cut into 150kb chunks (subject sequences)
from which the Blast database was built. The mouse genome sequence was
divided into 1 megabase chunks which were used to query the human
blast database with Blastn (WU-BLAST version, http://blast.wustle.edu) using the
following parameters:
M=1 N=-1 Q=5 R=1 Z=3000000000 Y=3000000000 B=10000 V=100
W=8 X=20 S=15 S2=15 gapS2=30
lcmask wordmask=seg wordmask=dust topcomboN=3
Twinscan was run using these alignments. The target genome
parameters were identical to Genscan parameters
(Human.iso) and the conservation parameters were the ones we
identify as "68-set-ortholog" (available upon request).
FILTERING BEST CANDIDATES
In this section we describe the protocol to generate a set of
SGP2 and Twinscan mouse predictions to be tested by
RT-PCR experiments. The goal of this approach is to generate a more
reliable set of predicted genes. For SGP2 and
Twinscan, the experiment, performed independently within each
predictor, consist in:
-
Identify human mouse orthologous pairs.
Mouse predicted aminoacid sequences were compared with the human
predicted aminoacids using Blastp
(Altschul et al., 1997). Orthologous pairs
were assigned where sequence pairs were aligned with Expect values
lower than 1x10e-6.
-
Discard known genes.
Mouse predictions ovelapping ENSEMBL genes coordinates were considered
not novel. We use the preliminary ENSEMBL annotation generated with
cDNA RIKEN database. Moreover, to assure the novelty classification,
mouse predictions were compared against RefSeq mRNA and ENSEMBL
database using Blastn, and predictions sharing more than 95%
nucleotide identity over at least 100bp were rejected (considered not
novel).
-
Keep only those pairs for which there is one intron position
aligned over the pairwise aminoacid alignment.
A global alignment
was produced for every hypothetical orthologous pair. T-coffee (http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html) was run with default parameters to align each pair of aminoacid
sequences. Then, the exonic structure was added to the global
alignment using exstral.pl (/~rcastelo/exstral.html). This program computes
the relative position of the intron boundaries in aligned pairs of
sequences. We considered the intron to be aligned when the boundaries
of the intron appears in the same coordinates of the alignment in both
sequences and the alignment is at least 50% conserved residues at both
sides of the aligned intron.
A subset of random predictions was extracted from each set
(SGP2 and Twinscan), and two adjacent exons across
an intron were chosen from the selected predictions for the RT-PCR
test. The experimental setup required that the exons were at least
30bp long, and the introns were at least 1000bp long. Pairs of exons
verifying these requirements are sorted by the sum of the scores of
the exons, and the top scoring pair was selected for the RT-PCR test.
RT-PCR VALIDATION of GENE-PREDICTIONS
The expression of a subset of the mouse gene
models of the HC21 genes was tested by RT-PCR. Total RNA derived from
12 different normal mouse adult tissues (brain, heart, kidney, thymus,
liver, stomach, muscle, lung, testis, ovary, skin and eyes) was
extracted, retro-transcribed and normalized as previously described
(Reymond et al, 2002). The quality of total RNA was tested by PCR
using MLH1 primers located at intronic sequences flanking exon 12
(Forward - 5' TGG TGT CTC TAG TTC TGG 3' and Reverse - 5' CAT TGT TGT
AGT AGC TCT GC 3'), as an indicator of possible genomic DNA
contamination. Primers for RT-PCR were designed the Primer3 program
(http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi). Primers
were designed from the sequence of distinct exons so that the possible
amplification of genomic DNA could be distinguished from cDNA
amplification. We chose a single PCR rather than a nested-PCR approach
to avoid false positive results due to illegitimate transcription
(Kaplan, 1992). Similar amounts of the 12
cDNAs (final dilution 250x)
were mixed with JumpStart REDTaq ReadyMix (Sigma) and 4 ng/µl of each
of the primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The
ten first cycles of PCR amplification were performed with a touchdown
annealing temperatures decreasing from 60 to 50ºC, annealing
temperature of the next 35 cycles was carried out at 50ºC. Amplimers
were separated on "Ready to Run" precast gels (Pharmacia). Positives
were directly sequenced.
HOMOLOGY SEARCH for PREDICTED PROTEINS
Predicted amino acid sequences were compared with a non-redundant
protein sequence database (
ftp://ftp.ncbi.nih.gov/blast/db/nr) using Blastp (Altschul et al., 1997). Homologues were assigned
where sequence pairs were aligned with Expect values less than
5x10e-3. These assignments were augmented by further
TBlastn sequence comparisons with expressed sequence tag
databases (in particular,
ftp://ftp.ncbi.nih.gov/blast/db/est_others).
Datasets
|
|
WHOLE GENOME PREDICTIONS
This section summarizes all the gene predictions on the mouse genome obtained from its comparison against the human genome. We include the reciprocal human mouse-based gene predictions.
SGP2
- Results on M. musculus:
- Results on H. sapiens:
TWINSCAN
RT-PCR POOLS
In this section you can access to supplementary
data of a fraction of genes selected for the RTPCR experiment,
collected in different files linked from the table below. The number
of 1019 novel genes given in the paper is in extrapolation from the
sucess rates observed in a random sample from each pool. Similarly,
the number of total non-redundant predictions in each pool is not the
direct sum of the number of predictions by sgp and twinscan, because
often sgp and twinscan predict overlapping genes: these have been
counted only once in the pools given in the paper. In the following
table we have included the sequences and coordinates of all the SGP
and twinscan genes and a table with the cooresponding pairs of
overlapped genes.
You can also get a table for the whole RT-PCR positive genes pool. It contains their Geneva Code and the corresponding Gene-Prediction Identifier together with a link to each gene Summary Datasheet. Just click on this link to retrieve such table.
PREDICTED GENES BEST CANDIDATES
The following links will open a browser window displaying a summary table with all the 476 genes that were submitted for the RT-PCR validation test, classified by mouse chromosome:
References
|
|
|
Altschul et al, 1997
| |
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ.
"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs."
Nucleic Acids Res 25(17):3389-402, Sep 1, 1997. [ PubMed ]
|
|
Flicek et al, 2002
| |
Flicek P, Keibler E, Hu P, Korf I and Brent MR.
Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny Map.
Genome Research 13(1):46-54, 2003. [ PubMed ]
|
|
Guigó et al, 1992
| |
Guigó R, Kundsen S, Drake N and Smith T.
Prediction of gene structure.
Journal of Molecular Biology 226:141-157, 1992. [ PubMed ]
|
|
Kaplan et al, 1992
| |
Kaplan JC, Kahn A and Chelly J.
Illegitimate transcription: its use in the study of inherited disease.
Human Mutation 1(5):357-360, 1992. [ PubMed ]
|
|
Korf et al, 2001
| |
Korf I, Flicek P, Duan D and Brent MR.
Integrating genomic homology into gene structure prediction.
Bioinformatics 17(Suppl 1): S140-148, 2001. [ PubMed ]
|
|
Parra et al, 2000
| |
Parra G, Blanco E and Guigó R.
GeneID in Drosophila.
Genome Research 10(4):511-515, 2000. [ PubMed ]
|
|
Parra et al, 2002
| |
Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW and Guigó R.
Comparative gene prediction in human and mouse.
Genome Research 13(1):108-117, 2003. [ PubMed ]
|
|
Reymond et al, 2002
| |
Reymond A, Marigo V, Yaylaoglu MB, Leoni A, Ucla C, Scamuffa N, Caccioppoli C, Dermitzakis ET, Lyle R, Banfi S, Eichele G, Antonarakis SE and Ballabio A.
Human chromosome 21 gene expression atlas in the mouse.
Nature 420(6915):578-582, 2002. [ PubMed ]
|
|
|