As the Human Genome Project continues to decipher the secrets of
our genetic makeup, scientists around the world are gaining new insight into our
understanding of the biology of health and of disease.
Computational tools are crucial in making the discoveries possible. To streamline the
retrieval of key information from the ever-growing banks of data, the Center for
Bioinformatics and Computational Genomics, or CBCG, has been created in the Lab's National
Energy Research Scientific Computing Division (NERSC).
Mapping and sequencing the human genome is an international effort between two dozen
large centers, including the Department of Energy's Joint Genome Institute. The current
rate of overall sequencing of base pairs is 150 million base pairs per year. However, to
fully sequence the three billion base pairs in the human genome by 2003, the participating
centers will need to sequence two million per day.
"When we looked at the genomics community, we saw that there will be an explosion
of data in the next few years," said NERSC Division Director Horst Simon. "We
also noticed that no one is really ready to handle that amount of data."
The information contained in the genome data is of immeasurable value to medical
research, biotechnology, pharmaceuticals, and researchers in fields ranging from
microorganism metabolism to structural biology. At expected data rates, the sequences
generated each day for the next five years will represent hundreds of new genes and
proteins.
"There have been dramatic changes in the tools used in both biology and
computation, and we see our job as getting those tools to work in synch so that you can
use the tools in one area to work for the other," said Manfred Zorn, co-leader of the
new center. Zorn's Bioinformatics Group was already developing computational tools for
this work in support of the Human Genome Grand Challenge. That work includes development
of specialized software modules, database design and developing methods for indexing
genomic information.
"One of the problems we see is that once we have all this information, each center
will have warehoused the data in their own computer system, following their own
scheme," Zorn said. "But for it to be truly useful, we have to have it in an
accessible format."
To make the information widely available, scientists at Oak Ridge National Laboratory
and Berkeley Lab established "The Genome Channel," a web-based resource which
provides a graphical interface using standard annotation for all genome sequences
completed to date. The Genome Channel allows users to zoom in on a particular chromosome,
then see how much of that chromosome has been sequenced. Users can then bore down into
each individual sequence and the accompanying annotation.
However, knowing the sequence of the DNA does not indicate the function of the genes -
specifically, the actions of their protein products: where, when, why, and how the
proteins act, the essence of the biological knowledge required.
Implicit in the DNA sequence is a protein's three-dimensional topography, which in turn
determines function. Uncovering this sequence-structure-function relationship is the core
goal of modern structural biology today. If there is a gene there, analyzing the folded
structure of the accompanying protein can give scientists a clue as to the gene's
characteristics.
Inna Dubchak, a computer scientist at CBCG, has developed statistical-based models to
predict fold classes of proteins. Such predictions are useful in that they can reduce the
number of possible protein structures, thereby helping scientists zero in faster on those
of most relevance.
In developing her method, Dubchak used the Standard Classification of Proteins, or
SCOP. She then took the physical and chemical properties of the amino acids which make up
proteins and represented each amino acid by a vector of its properties.
The data were then fed into a neural network, which Dubchak trained to recognize SCOP
fold classes. Over time, the neural network is then able to recognize sequence-shape
correlation. As a result, Dubchak's method can be used to classify protein sequences
according to certain protein topologies, or folds, (i.e., assign a fold class to a protein
sequence). Computational biologists can then take these potential folding structures and
use computers to see how closely they fit with actual proteins, thereby moving closer to
discoveries to improve health, preserve the environment and increase our understanding of
genetic makeup.
Quantitative descriptions of the three-dimensional structures of proteins and other
biological macromolecules holds significant promise for the pharmaceutical and
biotechnology industries in the search for effective new drugs with few or no side effects
and in the effort to understand the mystery of human disease.
Zorn's group also provides bioinformatics support to the Life Sciences Division and
works with a group at the UC San Francisco Cancer Center. Initial funding for the center
will be through the Laboratory Directed Research and Development Program.
"We want to become the premier provider of bioinformatics expertise and services
for biologists at Berkeley Lab," said Zorn, who was recently named as the Lab's
representative on the UC systemwide Life Sciences Informatics Task Force. "We also
plan to develop partnerships, carrying out research in bioinformatics and working with the
bioinformatics community in areas such as education, training and standards
development."
The center staff also includes Sylvia Spengler, Frank Olken and Donn Davy.