Copyright 1998 The New York Times Company
                            March 10, 1998

Interpreting Your Genes' Instructions

The lumbering caterpillar, the strutting peacock, the little girl on a swing -- like all living creatures they are the antithesis of a machine. Yet at the core of their bodies' cells is an instruction tape, written in cold code, as precise and elegant as that in any machine tool or computer.

The task of interpreting the human version of that tape is known as annotation, a process that depends heavily on computer programs with names like BLAST, RepeatMasker and Genscan. The programs apply the best current knowledge about how the human genome is constructed and between them help pinpoint the location of the human genes and their probable roles.

Two curious features of evolution's design are those of non-coding DNA and split genes. Some 97 percent of DNA does not code for genes. It consists of repeated sequences of no known function. The genes that do occur, small islands of sense in an ocean of meaninglessness, are not continuous runs of DNA but are split into an archipelago of separate units known as exons. (The cell trims out the sequence between the exons after it has copied the DNA).

When the annotators at the Genome Sequencing Center in St. Louis receive a new fragment of DNA, they first run a program called BLAST, which checks to see if any foreign DNA, like that of bacteria, has sneaked into the human sequence. Next comes RepeatMasker, a program that consults a data bank of the known repetitive sequences in human DNA and masks them out. The purpose is to save the gene-hunting program that comes next from wasting time fishing in the non-coding ocean.

The gene-hunter, known as Genscan, is a program that looks for exons, the component units of genes. However, Genscan finds far more exons than actually exist, so the annotators try to improve its score with a different source of information known as ESTs.

The EST trick is based on the fact that even though genes can hide from Genescan, they cannot evade the processing machinery of the human cell, which knows very well how to find and transcribe the copies it needs. The copies are made in the form of messenger-RNA. The messenger-RNAs can be extracted from living cells and copied back into DNA by the researcher. This complementary DNA, as it is called, can be sequenced in the form of short fragments that are called ESTs, for expressed sequence tags.

The ESTs are immensely valuable in analyzing the genome because as fragments of real genes they can be used to make the actual genes reveal their positions on the DNA. So many ESTs have now been sequenced that most, though probably not all, human genes have been sampled.

Thus if Genescan predicts a set of exons for which no ESTs are known, the annotators generally assume it is a spurious hit. Exons whose sequence matches ESTs are almost certainly part of a true gene.

When likely genes are located, the annotators again run BLAST, which searches the data bases for any similar genes in other species. Because of the unity of evolution, most human genes have counterparts in lower organisms, many of which have now been sequenced and their roles ascertained.

Warren Gish, chief annotator at the St. Louis center and a leading author of BLAST, recounted how the center had recently located the human equivalent of the hox genes, master genes that were found in fruit flies to guide the development of the embryo. One set of human hox genes lies in a cluster on chromosome 7, the St. Louis center's chief target.

If the annotators have the most fun, the mappers who work at the initial stage of the sequencing process probably have the most frustration, since theirs is still the least settled of the operations.

The first step in genome analysis is to extract the entire DNA from a source such as sperm. Then the material is broken into reproducible fragments with a class of bacterial enzymes that cleave DNA at specific sequences. Each fragment is amplified by being inserted into a bacterium that multiplies many times.

Each batch of bacteria, with its amplified human DNA fragment, is called a clone. The mappers must select a set of clones with overlapping fragments that between them span the entire length of the target human chromosome. How to tell one clone from another? Hundreds of distinctive short regions of DNA, known as markers, must be generated all along the length of the chromosome. With the markers, the mappers can in principle arrange the clones in the right order, producing what is called a sequence-ready map, a set of clones ready to hand over to the sequencers.

In practice, there are many gaps, some caused by a lack of appropriate markers, and some by stretches of DNA that failed to be cloned. Dr. John McPherson, the center's head mapper, said there were 300 gaps in his present map of chromosome 7. He is confident he can close all the gaps in time.

"Every gap we have tried to close we have closed," he said.