RESOURCES and DATASETS
Our group maintains the following Data Sets:
-  High throughput transcript discovery via array based normalisation of RACE libraries
DATASET is available from this link: /datasets/racearrays2007/ 
-  BMC Bioinformatics: Multiple Non-Collinear TF-map Alignments of Promoter Regions
Datasets and results of human-mouse-chicken-zebrafish orthologous gene regions that were used to train and optimize the parameters of the multiple TF-map alignment. Characterized real promoters and enhancers, artificial non-collinear examples. 
 DATASET is available from this link: /datasets/mmeta2006/
-  U12DB: a database of orthologous U12-type spliceosomal introns.
Database of clusters of orthologous U12 introns from 18 animal, 1 plant and 1 fungal species. 
 DATASET is available from this link: /datasets/u12/
-  
PLoS Computational Biology (2006): Transcription Factor Map Alignment of Promoter Regions..
Dataset of the 40 human-mouse gene promoter pairs that was used to optimize the parameters of the TF-map alignment. Dataset of different genomic orthologous regions for these genes. Dataset and results of the TF-map alignment on the 5333 CISRED human co-expressed genes. 
 DATASET is available from this link: /datasets/meta2005/
-  
Nucleic Acids Research (2006): ABS: a database of Annotated regulatory Binding Sites from orthologous promoters.
650 experimentally verified orthologous transcription factor binding sites (TFBSs). Annotations have been collected from the literature. This collection also includes the promoter sequences, cross-references to EntrezGene, PubMed and RefSeq, predictions by weight matrices collections, sequence alignments and graphical dotplots. 
 DATASET is available from this link: /datasets/abs2005/
-  Genome Biology (2006): EGASP: The human ENCODE GENOME ANNOTATION ASSESSMENT PROJECT.
Different evaluation programs were used to compare the accuracy of the gene predictions submitted to the GENCODE EGASP'05 workshop, held at the Sanger Center on May 6-7, 2005. The results from those evaluations are provided here, along with some discussion on the different methods to calculate the accuracies of each different approach at three levels of the gene structure (basically at nucleotide, exon, transcript/gene levels). 
 DATASET is available from this link: /datasets/egasp2005/
-  Nucleic Acids Research (2005): Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes.
Datasets of 311 putative novel human genes found using the comparative gene predictor SGP2 and the chicken genome sequence, the subset of 50 most promising predictions tested by RT-PCR and the GenBank accessions of the six RT-PCR positives. 
 DATASET is available from this link: /datasets/ggalhsapgenes2005/
-  Genome Research (2005): Comparison of Splice Sites in Mammals and Chicken.
Datasets for the comparative analysis of splice site sequences on a large collection of human, mouse, rat and chicken introns. The analyses performed on those datasets were focussing on the conservation of orthologous splice sites, the evolution of the U2/U12 major intron classes and the subtype switching within those classes. 
 DATASET is available from this link: /datasets/hmrg2004/
-  Bioinformatics (2004): Splice site identification by idlBNs.
Datasets of human splice sites from RefSeq-hg15 (ACCDON), internal exons from the Burset and Guigó and Rogic et al. human gene sets (BGROIEXONS) and splice, start and stop sites from RefSeq-hg16 not present in the Burset and Guigó and Rogic et al. human gene sets (NOBGRORS). 
 DATASET is available from this link: /datasets/splidlbns2004/
-  Science (2003): Selenoprotein gene prediction in Human.
All the programs and data used to identify selenoproteins in the human genome. Seven novel selenoprotein genes were found by SECIS and gene prediction, together with comparative genomics approaches. We believe the human selenoproteome to consist of 17 selenoprotein families (15kDa, DI, GPX, SelH, SelI, SelK, SelM, SelN, SelO, SelP, SelR, SelS, SelT, SelV, SelW, SPS2 and TR) and, in addition, two Cys-containing homologs (MsrA and SelU), which are selenoproteins in other organisms. 
 DATASET is available from this link: /datasets/sphuman2003/
-  PNAS (2003): Comparison of human and mouse genomes followed by experimental validation.
In this site we describe all the programs and data presented in Guigó et al, PNAS 2003. In that paper we estimated that near a thousand novel human genes that do not overlap known proteins can be verified experimentally. The method is based in the comparison of human and mouse genomes to enhance the resulting gene-predictions, plus a filtering step from which a sample of mouse predictions were tested by RT-PCR amplification and direct sequencing. 
 DATASET is available from this link: /datasets/mouse2002/
-  Genome Research (2003): Comparative Gene Prediction in Human and Mouse.
Supplementary materials for the SGP2paper are available from this section.SGP2is a gene prediction pogram that combines ab initio gene prediction withTBLASTXsearches between two genome sequences to provide both sensitive and specific gene predictions.
 DATASET is available from this link: /datasets/sgp2002/
-  EMBO reports (2001): Selenoprotein gene prediction in the Fly.
In this site we describe all the programs and data used to predict selenoproteins in the Drosophila melanogaster genome. Two novel selenoprotein families (SelK and SelH, previously named SelG and SelM) were found by coordination of gene and SECIS prediction. In addition, the fly genome is know to contain the SPS2 selenoprotein. 
 DATASET is available from this link: /datasets/spdroso2001/
-  Nucleic Acids Research (2001): Canonical and non-canonical mammalian splice sites.
A database (SpliceDB) of known mammalian splice site sequences has been developed. Weight matrices were built for the major splice groups, which can be incorporated into gene prediction programs. 
 SpliceDB is available at the computational genomic Web server of the Sanger Center and has a mirror site at SoftBerry. [Burset, Seledtsov and Solovyev, Nucleic Acid Research 29(1):255-259 (2001)]
-  Genome Research (2000): Gene Prediction Programs Evaluation in Large DNA Sequences.
Given the absence of experimentally verified large genomic data sets, we constructed an semi-artificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions in order to analize gene-prediction programs accuracy. 
 DATASET is available from this link: /datasets/gpeval2000/
-  Genome Research (2000): geneid in Drosophila melanogaster.
A set of training sequences (exons/introns) and the resulting parameters required to run geneid on Drosophila melanogaster genome. 
 DATASET is available from this link: /datasets/Dro_me/
-  Genomics (1996): Evaluation of gene structure prediction programs.
A number of computer programs for the prediction of gene structure in DNA genomic sequences are analyzed. The programs are tested in a large set of vertebrate sequences. 
 DATASET is available from this link: /datasets/genomics96/