GENSCAN Accuracy for


Invertebrate and Plant Sequences





-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-

This page summarizes the accuracy of GENSCAN for invertebrate (Drosophila) and plant (maize and Arabidopsis) genomic sequences. The first table gives accuracy statistics for the vertebrate version of the program on datasets of sequences from these organisms (see table legend for details). The nucleotide-level accuracy statistics for Drosophila and maize are generally comparable to the high levels achieved for vertebrate sequences, but the exon-level accuracy is to varying degrees lower. This version of the program is therefore recommended for Drosophila sequences and may be used for maize, but is not recommended for Arabidopsis sequences because of the unacceptably low nucleotide-level sensitivity and the high proportion of missed exons. Organism-specific versions of GENSCAN which perform better on maize and Arabidopsis sequences are discussed below.

Accuracy per bpAccuracy per exon
OrganismNo. SeqSn Sp ACCC Sn Sp (Sn+Sp)/2 ME WE
Drosophila2020.960.920.890.900.680.680.680.110.10
Maize420.940.930.900.900.670.710.690.090.08
Arabidopsis1200.810.930.780.840.570.720.660.250.04

Accuracy statistics for gene prediction programs are described here.

Legend:

The vertebrate version of GENSCAN was tested on the sequence sets described below. The set of 202 Drosophila melanogaster GenBank loci used was constructed by D. Kulp (U. C. Santa Cruz) and M. G. Reese (Lawrence Berkeley National Laboratories) on 12 Dec. 1996 as a standard for training/testing of gene prediction methods and is available by anonymous ftp. The set of 41 Zea mays GenBank loci was constructed by V. Brendel at Stanford University and is available by email on request. The set of 120 Arabidopsis thaliana GenBank loci was also constructed by V. Brendel and is described in:

Kleffe, J., Hermann, K., Vahrson, W., Wittig, B. and Brendel, V. (1996) Nucl. Acids Res. 24, 4718-4728.

All three sets consist exclusively of nonredundant complete genes whose annotation appears reliable by a variety of tests.

-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-

Organism-Specific Versions of GENSCAN


Certain changes have been made to the GENSCAN parameters which significantly improve the accuracy of the program for maize and Arabidopsis sequences (to be described elsewhere). The results of using these organsim-specific parameters on the previously described sets of maize and Arabidopsis sequences are shown in the table below. For maize, these changes actually resulted in slightly lower nucleotide-level accuracy relative to the vertebrate parameter set, but this is more than compensated by the greatly improved exon-level accuracy numbers. For Arabidopsis, the improvement is more even, with higher accuracy seen in almost every category.


Accuracy per bp Accuracy per exon
OrganismNo. SeqSn Sp ACCC Sn Sp (Sn+Sp)/2 ME WE
Maize420.860.960.860.900.780.870.840.150.04
Arabidopsis1200.910.930.860.890.670.690.690.110.08



The graphic above came from the POV-Ray archive.