Harvesting Genetic Data: LBL Computer Scientists work on Plant Genome Databases

Summer 1993

LBL computer scientists are working on plant genome databases that promise to yield a rich store of information

By Lynn Yarris, LCYarris@lbl.gov

The Human Genome Project, in which LBL researchers are playing an important role has received a considerable amount of public attention. There is, however, another national genome project that has received relatively little attention, even though it too offers potentially enormous dividends for all of humanity. This effort is the plant genome project of the U.S. Department of Agriculture (USDA), in which LBL researchers are also playing an important role.

As the Earth's population continues to grow, the need for new varieties of crops that are hardier, higher-yielding, and more nutritious has never been greater. Creating plant varieties with desired characteristics is an age-old practice, but agricultural scientists are now beginning to replace traditional methods of selective breeding with modern genetic engineering techniques. What used to be a trial-and-error approach, involving many generations of plants over a period of years, promises to become a much more controlled and rapid procedure.

"We are seeing a real data explosion from plant genetics research," says John McCarthy, a computer scientist in LBL's Information and Computing Sciences Division. "There is a critical need to collect, organize, and integrate this wide variety of new data as soon as possible."

McCarthy has been a key player in developing databases for LBL's Human Genome Center. In the spring of 1991, he was approached by USDA about adapting LBL human genome database technology for use with plant genome data.

Using database design tools developed by LBL's data management group under Victor Markowitz, McCarthy and other members of ICSD's genome computing group began working with USDA- sponsored researchers. Their goal was to develop databases for the genomes of wheat (the world's most important food crop), soybean (the nation's largest source of vegetable oil), and forest trees.

About a half-year into the project, McCarthy and his colleagues elected to work with ACEDB (A Caenorhabditis Elegans Data Base). This is a system developed in 1991 for the international nematode genome project by an English biologist, Richard Durbin, and a French physicist, Jean Thierry-Mieg (who worked with the theoretical physics group at LBL in 1983-84).

"ACEDB has many of the capabilities our genome databases need beyond traditional relational systems," says McCarthy. "Its open architecture enables groups like ours to collaborate in it ongoing development."

The versatility of ACEDB had also been demonstrated when it was successfully adapted for a database on Arabidopsis thaliana, a type of mustard that is a model organism for plant biologists-- comparable to animal models like C. elegans and the fruit fly (Drosophila). ACEDB is also being used for the joint UC Berkeley- LBL Drosophila physical mapping project.

"Durbin and Thierry-Mieg provide ACEDB's source code free and welcome collaborators," explains McCarthy. "Therefore, at LBL we've been able to make major contributions to design and development of new display modules and extensions to the database core."

Despite its origin as a genome database for a specific organism, ACEDB is a general-purpose, object-oriented, hypertext system, according to McCarthy. Information is presented in multiple window displays containing text, diagrams, and pictures. Scientists can browse through data using a computer mouse and click on components such as gene names to bring up detail windows.

"An object-oriented approach simplifies database design, maintenance and user interface development, which in turn makes it easier to collect and upload data in a rapidly changing research environment," McCarthy says. "This helps move information much more quickly from lab notebooks into public archives."

ACEDB itself runs on Unix systems, but can be run, through programs such as X-windows, over networks from Macintosh and IBM PC-compatible computers as well as Unix workstations. McCarthy reports that versions of ACEDB for the Macintosh and IBM-PC systems are currently being developed.

"The ACEDB and plant genome projects have demonstrated how much high-capacity computer networks can facilitate rapid development and international collaboration," says McCarthy. "Computer scientists and biologists at LBL use electronic mail to instantaneously to exchange questions, ideas, data, and code on a daily basis with colleagues in England, France, Germany, Japan, and throughout the United States."

In less than a year, McCarthy and his LBL colleagues have been able to help plant genome researchers design, implement, test, and transfer individual database operations to remote computers at Iowa State University (SoyBase), Cornell University (GrainGenes), and the USDA Regional Laboratory in Albany, Calif. (Dendrome). Researchers connected to the InterNet can run them from computers anywhere in the world. These databases have also been uploaded into the Plant Genome Database of USDA's National Agricultural Library in Beltsville, Md., which means they can now be accessed by telephone modem as well.

"It has really been enjoyable to work with the plant genome and ACEDB database projects," says McCarthy. "The strong cooperation between different collaborating groups has brought us a long way in a short amount of time."