geneid source docs

Description:

This module is implemented to score (to give a measure of reliability) and filter predicted exons. There are 3 different scoring sources which make up the final value for a given exon: score from signals (sites), score from protein coding potential probability and score from provided homology information. Statistical parameters for every type of exon are extracted from parameters file. For protein coding potential, a Markov model of order 5 is employed, supporting different isochores usage for predictions on different G+C content sequences. G+C frequencies and Markov transition scores are computed by using the accumulated sum technique, with a linear cost instead of the usual quadratic value.

Briefing:

int SelectIsochore(float percent, gparam** isochores)

Given a float value (between 0 and 1) representing the G+C value in a DNA region, returns the identifier of the isochore whose coding potential Markov model is adapted to work under this range. Isochores are not supposed to be sorted and range is not verified anywhere so it is strongly recommended to be careful when parameter file is modified.

float ComputeGC(packGC* GCInfo, long inigc, long endgc)

G+C content (percentage): computing step. For every region, subsequence of the original sequence or fragment, the percent of G+C is quickly computed by using the accumulated sum technique: instead of scanning the sequence whenever is necessary to count the G+C percentage in a subsequence of the original input (i.e. exons), it is much more efficient to write down the frequency of a given nucleotide until every position, and then, the absolute frequency for that nucleotide between 2 positions is the rest of both accumulated values (Linear time versus quadratic cost). Unknown nucleotides (N) are not taken into account because sequences sometimes might contain an important amount of them.

void CGScan(char* s, packGC* GCInfo, long l1, long l2)

G+C content (percentage): pre-processing step. Scan the whole sequence, counting how many C/Gs or Ns are in. Then, resting the accumulated values stored in any two positions (i.e. start and end of a subsequence) divided by the rest of these values (i.e. length of the subsequence) is the G+C content of the corresponding subsequence of the original input, between those two positions.

long OligoToInt(char* s, int ls)

Translation from a string into a numerical value according to the function f such that f(A) = 0, f(C) = 1, f(G) = 2, f(T) = 3, and f(N) = 4. It is used to index arrays using olinucleotides by translating them into integers.

void MarkovScan(char* sequence,
                gparam* gp,
                float* OligoDistIni[3], 
                float* OligoDistTran[3],
                long l1, long l2)

Score exons: pre-processing step. Exons are scored by using a Markov model: initial matrix and transition matrices. Score of a given exon is: score assigned for the first pentanucleotide (initial value) plus score computed for the hexanucleotides content derived from codon bias (transition values). To compute the transtion value, the accumulated values of scores for every possible (3) subsequence into the original input are computed. Then, to score an exon, the rest between the accumulated values for its 2 ends or delimiting positions must be computed. In this way, scoring one exon is executed with a constant cost instead of scanning the whole exon (linear), for every exon (linear time versus quadratic).

void HSPScan(packExternalInformation* external,
             packHSP* hsp, 
             int Strand, 
             long l1, long l2)

[Optional]. If homology information about the input sequence is provided, projection of HSPs overlapping current fragment of DNA is performed into an array of l2 - l1 +1 positions.

void HSPScan2(packExternalInformation* external,
              packHSP* hsp, 
              int Strand, 
              long l1, long l2)

[Optional]. If homology information about the input sequence is provided, the array containing HSP projection is preprocessed to save precomputed sums of every subset of positions in the current sequence fragment.

float ScoreHSPexon(exonGFF* exon, 
                   int Strand, 
                   packExternalInformation* external, 
                   long l1, long l2)

[Optional]. If homology information about the input sequence is provided, exons supported (total or partial intersection) by homology regions increase their score proportionally. HSPs are similarity to protein regions (projections over the sequence of blast High-scoring Segment Pairs in which, the best score for every position is recorded). .

long Score(exonGFF *Exons,
           long nExons,
           long l1,
           long l2,
           int Strand,
           packExternalInformation* external,
           packHSP* hsp,
           gparam** isochores,
           packGC* GCInfo)

Score exons: computing step. For every exon from the input, computing the coding potential score (Markov chain) by using a specific isochore according to its G+C content. Homology to protein score is computed from a provided list of HSPs. The final exon score result of a weighted combination (site, exon and homology factors) between coding potential score and score from both signals, plus homology score. There are cutoff points for the coding potential and final scores. It returns the number of exons overcoming the filter.

void ScoreExons(char *Sequence, 
                packExons* allExons, 
                long l1,
                long l2,
                int Strand,
                packExternalInformation* external,
                packHSP* hsp,
                gparam** isochores,
                int nIsochores,
                packGC* GCInfo)

Main exon scoring routine. control the data flow.