As the Human Genome Project continues to decipher the secrets of
          our genetic makeup, scientists around the world are gaining new insight into our
          understanding of the biology of health and of disease.
          Computational tools are crucial in making the discoveries possible. To streamline the
          retrieval of key information from the ever-growing banks of data, the Center for
          Bioinformatics and Computational Genomics, or CBCG, has been created in the Lab's National
          Energy Research Scientific Computing Division (NERSC). 
          Mapping and sequencing the human genome is an international effort between two dozen
          large centers, including the Department of Energy's Joint Genome Institute. The current
          rate of overall sequencing of base pairs is 150 million base pairs per year. However, to
          fully sequence the three billion base pairs in the human genome by 2003, the participating
          centers will need to sequence two million per day.
          "When we looked at the genomics community, we saw that there will be an explosion
          of data in the next few years," said NERSC Division Director Horst Simon. "We
          also noticed that no one is really ready to handle that amount of data."
          The information contained in the genome data is of immeasurable value to medical
          research, biotechnology, pharmaceuticals, and researchers in fields ranging from
          microorganism metabolism to structural biology. At expected data rates, the sequences
          generated each day for the next five years will represent hundreds of new genes and
          proteins.
          "There have been dramatic changes in the tools used in both biology and
          computation, and we see our job as getting those tools to work in synch so that you can
          use the tools in one area to work for the other," said Manfred Zorn, co-leader of the
          new center. Zorn's Bioinformatics Group was already developing computational tools for
          this work in support of the Human Genome Grand Challenge. That work includes development
          of specialized software modules, database design and developing methods for indexing
          genomic information.
          "One of the problems we see is that once we have all this information, each center
          will have warehoused the data in their own computer system, following their own
          scheme," Zorn said. "But for it to be truly useful, we have to have it in an
          accessible format."
          To make the information widely available, scientists at Oak Ridge National Laboratory
          and Berkeley Lab established "The Genome Channel," a web-based resource which
          provides a graphical interface using standard annotation for all genome sequences
          completed to date. The Genome Channel allows users to zoom in on a particular chromosome,
          then see how much of that chromosome has been sequenced. Users can then bore down into
          each individual sequence and the accompanying annotation.
          However, knowing the sequence of the DNA does not indicate the function of the genes -
          specifically, the actions of their protein products: where, when, why, and how the
          proteins act, the essence of the biological knowledge required.
          Implicit in the DNA sequence is a protein's three-dimensional topography, which in turn
          determines function. Uncovering this sequence-structure-function relationship is the core
          goal of modern structural biology today. If there is a gene there, analyzing the folded
          structure of the accompanying protein can give scientists a clue as to the gene's
          characteristics.
          Inna Dubchak, a computer scientist at CBCG, has developed statistical-based models to
          predict fold classes of proteins. Such predictions are useful in that they can reduce the
          number of possible protein structures, thereby helping scientists zero in faster on those
          of most relevance. 
          In developing her method, Dubchak used the Standard Classification of Proteins, or
          SCOP. She then took the physical and chemical properties of the amino acids which make up
          proteins and represented each amino acid by a vector of its properties.
          The data were then fed into a neural network, which Dubchak trained to recognize SCOP
          fold classes. Over time, the neural network is then able to recognize sequence-shape
          correlation. As a result, Dubchak's method can be used to classify protein sequences
          according to certain protein topologies, or folds, (i.e., assign a fold class to a protein
          sequence). Computational biologists can then take these potential folding structures and
          use computers to see how closely they fit with actual proteins, thereby moving closer to
          discoveries to improve health, preserve the environment and increase our understanding of
          genetic makeup.
          Quantitative descriptions of the three-dimensional structures of proteins and other
          biological macromolecules holds significant promise for the pharmaceutical and
          biotechnology industries in the search for effective new drugs with few or no side effects
          and in the effort to understand the mystery of human disease.
          Zorn's group also provides bioinformatics support to the Life Sciences Division and
          works with a group at the UC San Francisco Cancer Center. Initial funding for the center
          will be through the Laboratory Directed Research and Development Program.
          "We want to become the premier provider of bioinformatics expertise and services
          for biologists at Berkeley Lab," said Zorn, who was recently named as the Lab's
          representative on the UC systemwide Life Sciences Informatics Task Force. "We also
          plan to develop partnerships, carrying out research in bioinformatics and working with the
          bioinformatics community in areas such as education, training and standards
          development."
          The center staff also includes Sylvia Spengler, Frank Olken and Donn Davy.
          