Filtering Out Repetitive Elements




-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-




-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-

The presence of certain types of repetitive elements in a sequence may sometimes distort the results of GENSCAN. In particular, L1 elements are often predicted as genes. To avoid this potential problem, you may wish to pre-screen for repetitive elements with a program like RepeatMasker or censor which replace sequence segments matching any of a set of elements common to your organism (e.g., Alu, L1, etc.) by the same number of asterisks or `N's. (To get instructions for the censor email server, send mail to censor@charon.lpi.org with the word "help" in the body of the message.) Note that GENSCAN does accept sequences containing Ns or asterisks and that long stretches of such symbols are interpreted as probable repetitive elements (i.e. non-coding DNA). For large-scale sequencing efforts, other repeat-screening methods are also available, e.g., masking repeats detected by BLASTN or TBLASTN using the XBLAST procedure (Claverie, J.-M. (1994) In Automated DNA Sequencing and Analysis Techniques, M. D. Adams, C. Fields and J. C. Venter, eds., pp. 267-279.) Another option is to filter out repeats after running GENSCAN, e.g. to screen GENSCAN predicted peptides against a database of repeat sequences translated in all six frames. The accuracy statistics reported in the paper and on the associated web pages are for GENSCAN alone without any sort of pre- (or post-) screening for repetitive elements. Be aware also that some programs such as RepeatMasker may (optionally) also screen out simple sequence regions (microsatellites) such as ATATATATATATATATATAT, CAGCAGCAGCAGCAGCAGCAGCAG, etc. It is probably best not to screen out this type of repeat before running GENSCAN since such regions may sometimes be contained in coding exons (e.g., the CAG repeats which code for poly-glutamine in certain triplet repeat disease genes).


-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .-. .-.   .
||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|
|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||
'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-'   `-' `-

Please address any comments/questions/suggestions to: Chris Burge (cburge@mit.edu)



The graphic above came from the POV-Ray archive.