|
Gene Prediction Software: geneid
|
The group is involved in the ongoing development of the gene
prediction program geneid. geneid (Guigó et al., 1992) was
one of the first programs to predict full exonic structures of
vertebrate genes in anonymous DNA sequences. geneid was designed with
a hierarchical structure: First, gene defining signals (splice sites,
start and stop codons) were predicted along the query DNA
sequence. Next, potential exons were constructed from these sites, and
finally the optimal scoring gene prediction was assembled from the
exons. In the original geneid the scoring function to optimize was
rather heuristic: the sequence sites were predicted and scored using
Frequency Matrices (PWMs), a number of coding statistics were computed
on the predicted exons, and each exon was scored as a function of the
scores of the exon defining sites and of the coding statistics. To
estimate the coefficients of this function a neural network was used.
An exhaustive search of the space of possible gene assemblies was
performed to rank predicted genes according with an score obtained
through a complex function of the scores of the assembled exons.
During the nineties geneid had some usage, mostly through a now
nonfunctional e-mail server at Boston University (geneid@darwin.bu.edu )
and through a WWW server at the Institut Municipal d'Investigació Mèdica
(/geneid.html).
During this period, however, there
has been substantial developments in the field of computational gene
identification,and the original geneid had became clearly inferior to
other existing tools. Therefore, we started some time ago developing
an improved version of the geneid program, at least as accurate as
other existing tools, but much more efficient handling very large
genomic sequences both in terms of speed and usage of memory.
|
geneid prediction on a the ADH region of the fly genome compare with the actual gene structure of the region.
|
This new version maintains the hierarchical structure (signal to exon
to gene) in the original geneid , but we have simplified
the scoring schema and furnished it with a probabilistic meaning:
Scores for both, exons defining signals and protein coding potential,
are computed as log-likelihood ratios, which for a given predicted
exon are summed up into the exon score, in consequence also a
log-likelihood ratio. Then, a dynamic programming algorithm
(Guigó, 1998) is used to search the space of predicted exons to
assemble the gene structure (in the general case, multiple genes in
both strands) maximizing the sum of the scores of the assembled exons,
which can also be assumed to be a log-likelihood ratio.
Execution time in this new version of geneid grows
linearly with the size of the input sequence, currently at about two
MegaBases per minute in a Pentium III (500 Mhz) running linux. The
amount of memory required is also proportional to the length of the
sequence, about one MegaByte per MegaBase plus a constant amount of
about 15 MegaBytes, irrespective of the length of the sequence. In the
practice, thus, geneid is able to analyze sequences of
virtually any length, for instance chromosome size sequences.
This new version was initially trained to predict genes in the genome
sequence of Drosophila melanogaster (Parra et al., 2000), but
versions currently exist for human, Dictyostelium discoideum,
Fugu rubripes and Tetraodon Nigrovirides.
geneid is at the core of the developments in our group to
predict selenoprotein genes, and for comparative gene prediction.
- E. Blanco, G. Parra and R. Guigó.
"Using geneid to Identify Genes."
In A. D. Baxevanis and D. B. Davison, chief editors:
Current Protocols in Bioinformatics. Volume 1, Unit 4.3.
John Wiley & Sons Inc., New York, 2002. ISBN: 0-471-25093-7. [Table of Contents]
- G. Glökner, L. Eichinger, K. Szafranski, J.A. Pachebat, A.T. Bankier, P.H. Dear, R. Lehmann, C. Baumgart, G. Parra, J.F. Abril, R. Guigó, K. Kumpf, B. Tunggal, the Dictyostelium Genome Sequencing Consortium, E. Cox, M.A. Quail, M. Platzer, A. Rosenthal and A.A. Noegel.
"Sequence and Analysis of Chromosome 2 of Dictyostelium discoideum."
Nature 418(6893):79-85 (2002) [Abstract]
- G. Parra, E. Blanco, and R. Guigó.
"Geneid in Drosophila."
Genome Research 10(4):511-515 (2000) [Abstract] [Datasets]
- R. Guigó, M. Burset, P. Agarwal, J.F. Abril, R.F. Smith and J.W. Fickett.
"Sequence Similarity Based Gene Prediction."
In S. Suhai editor:
Genomics and Proteomics: Functional and Computational Aspects.
Plenum Publishing Corporation, 2000.
- R. Guigó.
"DNA composition, codon usage and exon prediction."
In M. Bishop, editor:
Genetic Databases. Pp:53-80.
Academic Press, 1999.
- R. Guigó.
"Assembling genes from predicted exons in linear time with dynamic programming."
Journal of Computational Biology, 5:681-702 (1998) [PubMed Abstract]
- R. Guigó.
"Computational gene identification."
Journal of Molecular Medicine, 75:389-393 (1997) [PubMed Abstract]
- J. W. Fickett and R. Guigó.
"Computational gene identification."
In S.R. Swindell, R.R. Miller and G. Myers, editors:
Internet for the Molecular Biologist. Pp:73-100.
Horizon Scientific Press, Oxford, United Kingdom, 1996.
- R. Guigó and J. W. Fickett.
"Distinctive sequence features in protein coding, genic non-coding, and intergenic human DNA."
Journal of Molecular Biology, 253:51-60 (1995) [Abstract]
- J. W. Fickett and R. Guigó.
"Estimation of protein coding density in a corpus of DNA sequence data."
Nucleic Acids Research, 20:2837-2844 (1993) [PubMed Abstract]
- R. Guigó, S. Knudsen, N. Drake, and T. F. Smith.
"Prediction of gene structure."
Journal of Molecular Biology, 226:141-157 (1992) [PubMed Abstract]
|
|