This directory contains the following 8 files as supplementary data to the manuscript titled "Computational Inference of Homologous Gene Structures in the Human Genome" by Ru-Fang Yeh, Lee P. Lim and Chris B. Burge, 2001. Genome Res. 11: 803-816. README.GenomeScan2000 -- this file LIB_GenomeScan2000 -- peptides predicted by GenomeScan in the Sep 5, 2000 version of the GoldenPath (oo18) assembled by J. Kent. [http://genome.ucsc.org LIB_FinishGene.fa.masked -- FastA file of the 194 masked sequence of the FinishGene dataset LIB_DraftGene.fa.masked -- FastA file of the 361 masked exon-containing sequence of the DraftGene dataset LIB_DraftGene.noncoding.fa.masked -- FastA file of the 677 masked intronic and extragenic sequence of the DraftGene dataset LIB_FinishGene.gff -- the mRNAvsGen cDNA/FinishGene alignment in GFF format LIB_DraftGene.gff -- the mRNAvsGen cDNA/DraftGene alignment in GFF format LST_FinishDraftGene -- the relationship between FinishGene and DraftGene If there are any questions about these data, please let me know (cburge@mit.edu). GenomeScan is a new gene prediction program which integrates the results of sequence similarity searches, in this case BLASTX versus the nonredundant protein database, into a model of gene structure and exon-intron composition which is similar to GenScan (hence the similarity in names). All of the predicted genes in this set have at least weak similarity to a known protein. For better or worse, no mRNA/cDNA or EST data were directly used in the construction of these datasets. The GenomeScan algorithm is described in a manuscript titled "Computational Inference of Homologous Gene Structures in the Human Genome" by Ru-Fang Yeh, Lee P. Lim and Chris B. Burge, 2001. Genome Res. 11: 803-816. These peptide sets were created as follows. 0. Each finished sequence or GoldenPath sequence was chopped into one or more ungapped segments referred to uncreatively as 'contigs'. Sequences were chopped at gaps (for our purposes any run of 100 or more consecutive Ns represents a gap) or to adhere to our maximum practical size limit of 500,000 bp. A script called 'GenomeScriptQ' was then run on each contig. This script does the following: 1. Mask sequence with RepeatMasker (A. Smit and P. Green). The GoldenPath sequences were masked by V. Pollara with the 09.09.2000 version of RepeatMasker (in sensitive mode). 2. Run GenScan (C. Burge and S. Karlin) on masked sequence on both strands together (default) and on forward and reverse strands separately. (A new version of GenScan allows prediction of genes on each DNA strand separately - this increases sensitivity a bit.) 3. BLAST all GenScan predicted peptides against the nr database (09.21.2000 version) using BLASTP version 2.0.14 with parameters -e 1e-5 -v 100 -b 100 (i.e. take top 100 BLAST hits with P-value < 0.00001) 4. BLAST unmasked genomic sequence against peptides hit in step 3 using BLASTX version 2.0.14 with parameters -G 20 -E 3 -e 0.05 (i.e. with increased gap penalties and reduced stringency) 5. Convert the BLASTX output from step 4 to Genoa format (A Perl script provided with the GenomeScan distribution does this - Genoa is a simple one-line summary format for input to GenomeScan) 6. Run GenomeScan on masked genomic sequence on both strands with parameters -r 10 -i 10 -start 1e6 -stop 1e6 (these parameters will be explained in the GenomeScan documentation, available soon) A sample LIB_GenomeScan2000 peptide is shown below: >Chr1:ctg12483.Ctg322|82370_bp|35.26_GC|GenomeScan_predicted_peptide_1|CMG|Known:NM_005478|2_ex|135_aa:gi|488543 5|ref|NP_005469.1|+:59..135:E=3e-51 MKGSIFTLFLFSVLFAISEVRSKESVRLCGLEYIRTVIYICASSRWRRHLEGIPQAQQAE TGNSFQLPHKREFSEENPAQNLPKVDASGEDRLWGGQMPTEELWKSKKHSVMSRQDLQTL CCTDGCSMTDLSALC In this example, the first field Chr1:ctg12483.Ctg322 indicates that this peptide comes from ctg12483 (GoldenPath identifier) on chromosome 1, and Ctg322 is our own designation indicating that this peptide comes from the 322th contig generated by our chopping procedure (step 0 above). 82370 is the length of the contig in base pairs. 35.26 is the C+G% composition of the contig. GenomeScan_predicted_peptide_1 is our own designation indicating that this is the first predicted peptide in the contig. CMG indicates that this is a 'Complete Multi-exon Gene'. The other possible entries for this field are: SEG = Single-Exon Gene IGF = Initial Gene Fragment (initial exon + 0 or more internal exons) NGF = Internal Gene Fragment (1 or more internal exons - no start or stop) TGF = Terminal Gene Fragment (0 or more internal exons + terminal exon) Known:NM_005478 indicates that this peptide has a strong blastN hit (longer than 100 bp with >= 98% identity) to the RefSeq cDNA with accession number NM_005478, therefore is listed as a "Known" gene. The other possible entries for this field are: ExprHomol = expressed homologous gene (genes that have strong blastN hits longer than 100 bp with >= 97% identity to the Oct 2000 version of the human dbEst database) Homol = homologous gene (genes with only protein homology) 2_ex indicates that the predicted gene has 2 exons 135_aa means that the predicted peptide is 135 amino acids long gi|4885435|ref|NP_005469.1|+:59..135:E=3e-51 indicates that the strongest hit to this predicted gene in the second BLAST (step 4 above) matched amino acids 59..135 of the target protein (GI number 4885435) to the genomic contig with E-value 3e-51. In this example NP_005469.1 is the accession number of this protein - the plus sign is our designation that this protein sequence begins with M. A minus sign in this location means that the protein does not begin with M (suggesting that it is not a complete protein). Caveats ------- When using these data, please keep in mind that: - the genome is not complete and may be contaminated by vector, etc. - the assembly may contain artifactual duplications - genes may be missed or split into pieces by the prediction process - genes may be fused together by the prediction process - some predicted genes may represent artifacts or pseudogenes