GENSCAN


Accuracy vs Exon Probability



-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-
 


-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-
 

An imporant feature of GENSCAN is that, because it is based on a probabilistic model of genomic sequence composition / gene structure, it is able to assign meaningful probabilities to particular events, e.g., the event E that a particular exon is correct. This probability, P(E), is defined as the sum of the probabilities under the model of all possible "parses" (gene structure descriptions) which contain the exact exon E in the correct reading frame. Though this sum is typically far too large to evaluate by exhaustive enumeration, it can be calculated in a reasonable amount of time using an approach called the "forward-backward" procedure (see Rabiner, L., 1989 Proc. IEEE 77, 257-285 for a general discussion of this method or my thesis for a description of the streamlined method used in the context of GENSCAN). The probability of each predicted exon calculated in this fashion is displayed in the second-to-last column of the text output (headed by the letter "P"). Interestingly, such probabilities provide a useful quantitative guide to the likelihood that a given exon is correct. This was demonstrated by partitioning exons predicted in the Burset & Guigó (1996) set of 570 vertebrate gene sequences on the basis of the exon probability and then determining accuracy statistics for each group separately. This data is shown in the table below (see also Burge, C. & Karlin, S. (1997) J. Mol. Biol. 268, 78-94.)


ProbabilityPredictedAccuracy Class
Range Exons Exactly CorrectPartially CorrectOverlappingWrong
0.00 - 0.5024829.8%27.8%4.0%38.3%
0.50 - 0.7536254.1%26.2%2.2%17.4%
0.75 - 0.9033774.8%16.0%1.2% 8.0%
0.90 - 0.9526387.8% 6.1%0.4% 5.7%
0.95 - 0.9955192.4% 3.4%0.2% 4.0%
0.99 - 1.0091797.7% 0.9%0.0% 1.4%
Total 2,67880.6% 9.7%0.9% 8.8%


Legend. A predicted exon is said to be exactly correct if it matches a true (annotated) exon precisely, i.e. both endpoints correct; partially correct if one endpoint is correct; overlapping if neither endpoint is correct, but it overlaps one or more true exons; and wrong if it does not overlap a true exon.




Some implications of the data shown in the table above are as follows.



-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-

Back to the GENSCAN Web site.

Please address any comments/questions/suggestions to: cburge@mit.edu


The graphic above came from the POV-Ray archive.