New Center for Bioinformatics and Computational Genomics at NERSC

January 29, 1999

By Jon Bashor, jbashor@lbl.gov

As the Human Genome Project continues to decipher the secrets of our genetic makeup, scientists around the world are gaining new insight into our understanding of the biology of health and of disease.

Computational tools are crucial in making the discoveries possible. To streamline the retrieval of key information from the ever-growing banks of data, the Center for Bioinformatics and Computational Genomics, or CBCG, has been created in the Lab's National Energy Research Scientific Computing Division (NERSC).

Mapping and sequencing the human genome is an international effort between two dozen large centers, including the Department of Energy's Joint Genome Institute. The current rate of overall sequencing of base pairs is 150 million base pairs per year. However, to fully sequence the three billion base pairs in the human genome by 2003, the participating centers will need to sequence two million per day.

"When we looked at the genomics community, we saw that there will be an explosion of data in the next few years," said NERSC Division Director Horst Simon. "We also noticed that no one is really ready to handle that amount of data."

The information contained in the genome data is of immeasurable value to medical research, biotechnology, pharmaceuticals, and researchers in fields ranging from microorganism metabolism to structural biology. At expected data rates, the sequences generated each day for the next five years will represent hundreds of new genes and proteins.

"There have been dramatic changes in the tools used in both biology and computation, and we see our job as getting those tools to work in synch so that you can use the tools in one area to work for the other," said Manfred Zorn, co-leader of the new center. Zorn's Bioinformatics Group was already developing computational tools for this work in support of the Human Genome Grand Challenge. That work includes development of specialized software modules, database design and developing methods for indexing genomic information.

"One of the problems we see is that once we have all this information, each center will have warehoused the data in their own computer system, following their own scheme," Zorn said. "But for it to be truly useful, we have to have it in an accessible format."

To make the information widely available, scientists at Oak Ridge National Laboratory and Berkeley Lab established "The Genome Channel," a web-based resource which provides a graphical interface using standard annotation for all genome sequences completed to date. The Genome Channel allows users to zoom in on a particular chromosome, then see how much of that chromosome has been sequenced. Users can then bore down into each individual sequence and the accompanying annotation.

However, knowing the sequence of the DNA does not indicate the function of the genes - specifically, the actions of their protein products: where, when, why, and how the proteins act, the essence of the biological knowledge required.

Implicit in the DNA sequence is a protein's three-dimensional topography, which in turn determines function. Uncovering this sequence-structure-function relationship is the core goal of modern structural biology today. If there is a gene there, analyzing the folded structure of the accompanying protein can give scientists a clue as to the gene's characteristics.

Inna Dubchak, a computer scientist at CBCG, has developed statistical-based models to predict fold classes of proteins. Such predictions are useful in that they can reduce the number of possible protein structures, thereby helping scientists zero in faster on those of most relevance.

In developing her method, Dubchak used the Standard Classification of Proteins, or SCOP. She then took the physical and chemical properties of the amino acids which make up proteins and represented each amino acid by a vector of its properties.

The data were then fed into a neural network, which Dubchak trained to recognize SCOP fold classes. Over time, the neural network is then able to recognize sequence-shape correlation. As a result, Dubchak's method can be used to classify protein sequences according to certain protein topologies, or folds, (i.e., assign a fold class to a protein sequence). Computational biologists can then take these potential folding structures and use computers to see how closely they fit with actual proteins, thereby moving closer to discoveries to improve health, preserve the environment and increase our understanding of genetic makeup.

Quantitative descriptions of the three-dimensional structures of proteins and other biological macromolecules holds significant promise for the pharmaceutical and biotechnology industries in the search for effective new drugs with few or no side effects and in the effort to understand the mystery of human disease.

Zorn's group also provides bioinformatics support to the Life Sciences Division and works with a group at the UC San Francisco Cancer Center. Initial funding for the center will be through the Laboratory Directed Research and Development Program.

"We want to become the premier provider of bioinformatics expertise and services for biologists at Berkeley Lab," said Zorn, who was recently named as the Lab's representative on the UC systemwide Life Sciences Informatics Task Force. "We also plan to develop partnerships, carrying out research in bioinformatics and working with the bioinformatics community in areas such as education, training and standards development."

The center staff also includes Sylvia Spengler, Frank Olken and Donn Davy.

The staff of the new Center for Bioinformatics and Computational Genomics: Left to right, standing: David Demirjian, Inna Dubchak, Manfred Zorn, Donn Davy, Denise Wolf, Janice Mann, Sylvia Spengler Kneeling: Igor Dralyuk, Mischelle Merritt.
Photo by Roy Kaltschmidt. (XBD9901-00199)

New Center for Bioinformatics and Computational Genomics at NERSC

January 29, 1999

Search | Home | Questions