Promoter prediction: practical exercise

Regulation of Human obese protein gene

Practical exercise
Enrique Blanco - eblanco@imim.es

Abstract: In this exercise, the previously annotated promoter region of the Leptin gene (obese protein gene) will be used to test different methods for predicting regulatory elements. First of all, a matrix will be constructed from a real collection of sites. Secondly, the TRANSFAC database will be accessed to extract real matrices and then, the promoter sequence will be scanned searching for promoter motifs. Finally, due to the number of false positives that will be obtained, a phylogenetic approach will be suggested. Both human and mouse homologues will be aligned to elucidate the coordinates of the actual binding sites.
Colour legend:

Genomic element
Operations or links

A. Description of the gene

Step 1. Retrieve the annotation and the sequence of the gene (EMBL database)

Go to EMBL database at EBI

mRNA sequence: Type U43653 in Nucleotide sequences

On top, click over the EMBL:HS436531 entry

Have a look at the description: IDs, references, attributes, sequences

Search the Feature of Coding Sequence (FT CDS). Click over and check the ORF correctness: the beginning and the end of the sequence correspond respectively to the Start and Stop codons?

Step 2. Learn more about the Leptin gene

Using a genome browser

Go back to the initial screen that contained the result of your first query.

On the left, you will find the Display Options box.

Select the FastaSeqs view and press the button Apply Display Options

Open the UCSC genome browser

Select the alignment program Blat (human genome)

Paste the Fasta sequence of the Leptin gene and submit the query

Browse the first hit in the list of matches

Have a look at the different displaying options. We recommend to zoom out 10x the initial picture to explore the genomic landscape around the gene. For instance, try to:

obtain the RefSeq gene sequence
check the presence of a CpG island in the promoter
examine the mRNAs supporting the gene annotation
evaluate the conservation between orthologues

Task1: What do you have to do if you want to see the computationally predicted transcription factor binding sites?

Task2: Try to locate the sequence in other genomes using BLAT (e.g. mouse)

Using the LocusLink database

Go to LocusLink database at NCBI

Type U43653 in Query

Click on the entry LEP (leptin)

Identify main fields in the entry: functional description, NM and NP annotations

Step 3. PROMOTER information: sequence and experimental annotation

Go to EMBL database at EBI and type U43589 in Nucleotide sequences

Promoter sequence:
promoter (FASTA sequence, 1000 bps upstream the TSS) [Entry: U43589]

Publication:
Mason MM, He Y, Chen H, Quon MJ, Reitman M. Regulation of leptin promoter function by Sp1, C/EBP, and a novel factor.Endocrinology. 1998 Mar;139(3):1013-22.

Promoter annotation in GFF format (see more about GFF format here)

Figure 1. Graphical representation of the three regulatory elements annotated in the promoter U43589 (500 bps upstream the TSS)

B. Building representations of binding sites

Step 4. Accessing Transfac database

Go to TRANSFAC database

In TRANSFAC 6.0: choose Search action

Select the table of Factor

Enter the factor name TBP (tata binding protein)

Set Factor Name (FA) as searching field and submit the query

Select (T00794): you will find a description of the factor in human

(On the left) Find these fields: (BS) for binding sites, (MX) for matrices

Select one of the sites for inspection
Note: TRANSFAC is free for users from non-profit organizations but requires a registration

Step 5. Building a model from a set of actual sites

This is a collection of real TBP sites extracted from TRANSFAC. Observe the different characteristics and the conservation of the core

Open the CLUSTALW webserver at EBI

Paste the collection of 23 TBP sites

Switch on the boxes:

ALIGNMENT = fast
COLOR ALIGNMENT = yes
OUTPUT FORMAT = aln wo/numbers

Press the Run button

Open the WebLogo webserver

Paste the CLUSTAL alignment into the corresponding box

Activate DNA/RNA in the Sequence type box

Submit the query (Create logo) to obtain a representation for the collection of TBP sites as the following. Notice the highligthed core of the binding site (TATAAAA)

Figure 2. Graphical representation of the alignment of 23 real TATA binding sites

Step 6. Obtaining the TRANSFAC position weight matrices

Go to TRANSFAC database

In TRANSFAC 6.0: choose Search action

Select the table of Matrix

Enter the factor name TATA

Set Factor Name (FA) as searching field and submit the query

There are two entries: M00252 and M00216

Select M00252 matrix

Repeat the procedure to recover the SP1 (M00008) and c/EBP (M00159) matrices

Conserve the windows containing the three matrices
Alternative solution: PROMO is a database of pre-computed matrices that allows you to select the species or group of species from which a new weight matrix will be constructed for a given factor, using TRANSFAC binding sites.

C. Computational prediction of regulatory elements (binding sites)

Step 7. Searching for the annotated regulatory elements with current matrices

Open RSA tools webserver

On the left frame, click on Pattern matching - patser (matrices)

Paste the Human obese protein gene promoter (1000 bps)

Select transfac as Matrix Format and paste the Transfac TATA matrix (including matrix header)

Set Origin to start (of the sequence) and press GO

Check the results: one of these two putative TATA sites is the real one (use the annotations)

To obtain a graphical representation of predictions, press feature map

Set as Display limits from 0 to 1000 and press GO

Repeat the procedure using the SP1 and cEBP matrices, trying to find the real sites into the predictions. Notice the amount of false positives predicted only using one matrix

Step 8. Ab initio promoter prediction

Go to TRANSFAC applications

Choose the program Match to scan promoter sequences searching for sites using the complete library of TRANSFAC matrices

Paste the Human obese protein gene promoter in the text area

Set cut-offs: 0.75 (matrix similarity) and 0.85 (core similarity)

Submit the query

Find the real annotations (e.g. TBP and CEBP) in this text output. Notice the huge number of false positive predictions

Figure 3. Graphical representation of predicted binding sites using MATCH + TRANSFAC in the promoter sequence U43589 (all of the predictions are not shown)

D. Comparative promoter prediction (human/mouse)

Step 9. Human-Mouse comparisons

We have obtained the homologous gene promoter (FASTA, 1000 bps upstream the TSS) in mouse [Entry: U36238]

Now, these are the annotations (promoter elements) in both sequences (human and mouse)

This is a graphical comparison of both promoter annotations. Observe the phylogenetic footprinting or conservation in the regulatory elements

Figure 4. Graphical comparison of the annotations in the human promoter U43589 and its homologue in mouse (500 bps upstream the TSS)

Step 10. Locating short conserved regulatory elements

Connect to Blast 2 Sequences web server

Paste both sequences [human promoter and mouse promoter] in the corresponding text boxes

To detect short conserved stretches of DNA, set the following parameters:

Mismatch = -5
Gap extension = 0

Notice that some short very well conserved HSPs (blast fragments) at the end of the sequence. Check the annotations to verify whether they correspond to real binding sites or not

Figure 5. Graphical comparison of blastn alignment of human promoter U43589 and its homologue U36238 in mouse

Now, ab initio promoter prediction serches can be performed again but only on those interesting regions, using RSA tools or TRANSFAC

When more than 2 genomes are available, a multiple local alignment can be performed with programs such as MEME or Alignace

E. Results

Here you can find the solutions to every exercise:

Gene annotation: EMBL record
Gene annotation: EMBL record (plain text)
FASTA sequence of the entry U43653
Gene annotation: Locus link
Promoter annotation: PubMed record
Promoter annotation: NCBI entry U43589

TBP site
Multiple alignment of TBPs
TBP sequence logo
TATA box matrix
SP1 matrix
cEBP matrix

Putative TATA boxes (text)
Putative SP1 sites (text)
Putative cEBP sites (text)
Putative TATA boxes (plot)
Putative SP1 sites (plot)
Putative cEBP sites (plot)
Match-TRANSFAC prediction

Promoter annotation: NCBI entry U36238 (mouse)
Blast2seq alignment

F. Bibliography

J.F. Abril and R. Guig�. gff2ps: visualizing genomic annotations. Bioinformatics 16:743-744 (2000).

Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research 31:374-378 (2003).

van Helden J. Regulatory sequence analysis tools.Nucleic Acids Res. 31:3593-3596 (2003).

JD Thompson, DG Higgins, and TJ Gibson. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acid Res. 22:4673-4680 (1994).

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215:403-410 (1990).

Timothy L. Bailey and Charles Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California (1994).

Roth, FR, Hughes, JD, Estep, PE & GM Church. Finding DNA Regulatory Motifs within Unaligned Non-Coding Sequences Clustered by Whole-Genome mRNA Quantitation. Nature Biotechnology 16:939-945 (1998).

X. Messeguer, R. Escudero, D. Farr�, O. N��ez, J. Mart�nez and M.Mar Alb�. PROMO: detection of known transcription regulatory elements using species-tailored searches. Bioinformatics Vol. 18: 333-334 (2002).

Mason MM, He Y, Chen H, Quon MJ, Reitman M. Regulation of leptin promoter function by Sp1, C/EBP, and a novel factor. Endocrinology. 139:1013-1022 (1998).