ScoreExons.c  geneid v 1.2 source documentation 
Description: 
This module is implemented to score (to give a measure of reliability) and
filter predicted exons. There are 3 different scoring sources which make
up the final value for a given exon: score from signals (sites), score from
protein coding potential probability and score from provided homology
information. Statistical parameters for every type of exon are extracted
from parameters file. For protein coding potential, a Markov model of order
5 is employed, supporting different isochores usage for predictions on
different G+C content sequences. G+C frequencies and Markov transition
scores are computed by using the accumulated sum technique, with a linear
cost instead of the usual quadratic value.

Briefing: 
int SelectIsochore(float percent, gparam** isochores) 
Given a float value (between 0 and 1) representing the G+C value in a DNA
region, returns the identifier of the isochore whose coding potential
Markov model is adapted to work under this range. Isochores are not supposed
to be sorted and range is not verified anywhere so it is strongly recommended
to be careful when parameter file is modified.

float ComputeGC(packGC* GCInfo, long inigc, long endgc) 
G+C content (percentage): computing step. For every region, subsequence of
the original sequence or fragment, the percent of G+C is quickly computed
by using the accumulated sum technique: instead of scanning the sequence
whenever is necessary to count the G+C percentage in a subsequence of the
original input (i.e. exons), it is much more efficient to write down the frequency
of a given nucleotide until every position, and then, the absolute frequency
for that nucleotide between 2 positions is the rest of both accumulated
values (Linear time versus quadratic cost). Unknown nucleotides (N) are not
taken into account because sequences sometimes might contain an important
amount of them.

void CGScan(char* s, packGC* GCInfo, long l1, long l2) 
G+C content (percentage): preprocessing step. Scan the whole sequence, counting
how many C/Gs or Ns are in. Then, resting the accumulated values stored
in any two positions (i.e. start and end of a subsequence) divided by the
rest of these values (i.e. length of the subsequence) is the G+C content
of the corresponding subsequence of the original input, between those
two positions.

long OligoToInt(char* s, int ls) 
Translation from a string into a numerical value according to the
function f such that f(A) = 0, f(C) = 1, f(G) = 2, f(T) = 3, and f(N) = 4.
It is used to index arrays using olinucleotides by translating them into
integers.

void MarkovScan(char* sequence, gparam* gp, float* OligoDistIni[3], float* OligoDistTran[3], long l1, long l2) 
Score exons: preprocessing step. Exons are scored by using a Markov model:
initial matrix and transition matrices. Score of a given exon is: score
assigned for the first pentanucleotide (initial value) plus score computed
for the hexanucleotides content derived from codon bias (transition values).
To compute the transtion value, the accumulated values of scores for every
possible (3) subsequence into the original input are computed. Then, to score
an exon, the rest between the accumulated values for its 2 ends or delimiting
positions must be computed. In this way, scoring one exon is executed with
a constant cost instead of scanning the whole exon (linear), for every exon
(linear time versus quadratic).

void HSPScan(packExternalInformation* external, packHSP* hsp, int Strand, long l1, long l2) 
[Optional]. If homology information about the input sequence is provided,
projection of HSPs overlapping current fragment of DNA is performed into
an array of l2  l1 +1 positions.

void HSPScan2(packExternalInformation* external, packHSP* hsp, int Strand, long l1, long l2) 
[Optional]. If homology information about the input sequence is provided,
the array containing HSP projection is preprocessed to save precomputed
sums of every subset of positions in the current sequence fragment.

float ScoreHSPexon(exonGFF* exon, int Strand, packExternalInformation* external, long l1, long l2) 
[Optional]. If homology information about the input sequence is provided,
exons supported (total or partial intersection) by homology regions increase their
score proportionally. HSPs are similarity to protein regions (projections
over the sequence of blast Highscoring Segment Pairs in which, the
best score for every position is recorded). .

long Score(exonGFF *Exons, long nExons, long l1, long l2, int Strand, packExternalInformation* external, packHSP* hsp, gparam** isochores, packGC* GCInfo) 
Score exons: computing step. For every exon from the input, computing the
coding potential score (Markov chain) by using a specific isochore
according to its G+C content. Homology to protein score is computed from
a provided list of HSPs. The final exon score result of a weighted
combination (site, exon and homology factors) between coding potential
score and score from both signals, plus homology score. There are
cutoff points for the coding potential and final scores. It returns
the number of exons overcoming the filter.

void ScoreExons(char *Sequence, packExons* allExons, long l1, long l2, int Strand, packExternalInformation* external, packHSP* hsp, gparam** isochores, int nIsochores, packGC* GCInfo) 
Main exon scoring routine. control the data flow.

Enrique Blanco Garcia © 2003