Filtering Out Repetitive Elements
-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-
-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-
The presence of certain types of repetitive elements in a
sequence may sometimes distort the results of GENSCAN.
In particular, L1 elements are often predicted as genes.
To avoid this potential problem,
you may wish to pre-screen for repetitive elements
with a program like
RepeatMasker
or censor which replace sequence segments
matching any of a set of elements common to your organism
(e.g., Alu, L1, etc.) by the same number of asterisks or `N's.
(To get instructions for the censor email server, send mail to
censor@charon.lpi.org
with the word "help" in the body of the message.)
Note that GENSCAN does accept sequences containing Ns or asterisks
and that long stretches of such symbols are interpreted as probable
repetitive elements (i.e. non-coding DNA).
For large-scale sequencing efforts, other repeat-screening methods
are also available, e.g., masking repeats detected by BLASTN or
TBLASTN using the XBLAST procedure (Claverie, J.-M. (1994)
In Automated DNA Sequencing and Analysis Techniques,
M. D. Adams, C. Fields and J. C. Venter, eds., pp. 267-279.)
Another option is to
filter out repeats after running GENSCAN, e.g. to
screen GENSCAN predicted peptides against a database of
repeat sequences translated in all six frames.
The accuracy statistics reported in the paper
and on the associated web pages are for GENSCAN alone
without any sort of pre- (or post-) screening for repetitive elements.
Be aware also that
some programs such as RepeatMasker may (optionally)
also screen out
simple sequence regions (microsatellites) such as
ATATATATATATATATATAT, CAGCAGCAGCAGCAGCAGCAGCAG, etc.
It is probably best not to screen out this type of repeat
before running GENSCAN since such regions may sometimes be
contained in coding exons (e.g., the CAG repeats which
code for poly-glutamine in certain triplet repeat
disease genes).
-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-
Please address any comments/questions/suggestions to:
Chris Burge
(cburge@mit.edu)
The graphic above came from the
POV-Ray archive.