China Clipper Project to Enhance Access to Information

July 24, 1998

By Jon Bashor, jbashor@lbl.gov

A newly funded computer research program at Berkeley Lab could revolutionize the way scientific instruments, computers and humans work together to gather, analyze and use data. Funded by the U.S. Department of Energy, the program will build on efforts made over the past 10 years to gather and store information and make it available over computer networks. The program is called "China Clipper" in reference to the 1930s commercial air service which spanned the Pacific Ocean and opened the door to today's global air service.

"I believe that our China Clipper project epitomizes the research environment we will see in the future," says Bill Johnston, leader of the Lab's Imaging and Distributed Computing Group. "It will provide an excellent model for online scientific instrumentation. Data are fundamental to analytical science, and one of my professional goals is to greatly improve the routine access to scientific data -- especially very large datasets -- by widely distributed collaborators, and to facilitate its routine computer analysis."

The idea behind China Clipper, like the pioneering air service, is to bring diverse resources closer together. In this case, scientific instruments such as electron microscopes and accelerators would be linked by networks to data storage "caches" and computers. China Clipper will provide the "middleware" to allow these separate components, often located hundreds or thousands of miles apart, to function as a single system. Johnston is scheduled to discuss the Lab's work in this area next week at an IEEE symposium on High Performance Distributed Computing.

Data intensive computing

Modern scientific computing involves organizing, moving, visualizing, and analyzing massive amounts of data from around the world, as well as employing large-scale computation. The distributed systems that solve large-scale problems involve aggregating and scheduling many resources. For example, data must be located and staged, and cache and network capacity must be available at the same time as computing capacity.

Every aspect of such a system is dynamic: locating and scheduling resources, adapting running application systems to availability and congestion in the middleware and infrastructure, and responding to human interaction. The technologies, services and architectures used to build useful high-speed, wide area distributed systems constitute the field of data intensive computing.

Enhancing data intensive computing will make research facilities and instruments at various DOE sites available to a wider group of users. Berkeley Lab scientists are developing China Clipper in collaboration with their counterparts at the Stanford Linear Accelerator Center, Argonne National Laboratory. and DOE's Energy Sciences Network, or ESnet.

"This will lead to a substantial increase in the capabilities of experimental facilities," predicts Johnston.

Faster turnaround of information

As an example of benefits, Johnston cites a project called "WALDO" (Wide Area Large Data Object) which makes it possible for physicians to have immediate access to patients' medical images. Johnston's group -- together with Pacific Bell, Livermore's NTON optical network testbed project and others -- worked with Kaiser Permanente to produce a prototype online, distributed, high-data-rate medical imaging system. The project allowed cardio-angiography data to be collected directly from a scanner in a San Francisco hospital. The system was connected to a high-speed Bay Area network, allowing data to be collected, processed, and stored at Berkeley Lab and accessed by cardiologists at the Kaiser Oakland hospital. Currently such images are processed and kept at a central office, and it can take weeks for doctors to see one or two images. With the WALDO real-time acquisition and cataloguing approach, they had access in a few hours.

Better research

This work is guided by the vision that faster access to data will allow scientists to conduct their work more efficiently and gain new insights. Research often starts out with a scientific model. Scientists then conduct an experiment and compare the actual results with what was expected. Understanding the resulting differences is where the real science happens, Johnston says. China Clipper is expected to lead to better utilization of instrumentation for experiments and fast comparisons of actual experiments and computational models. Streamlining the test-and-compare process could significantly increase the rate of scientific discovery.

Evolution of an idea

China Clipper is the culmination of a decade of research and development of high-speed, wide area, data intensive computing. The first demonstration of the project's potential, Johnston said, came during 1989 hearings held by then-Senator Al Gore on his High Performance Computing and Communications legislation. Because the Senate room had no network connections at the time, a simulated transmission of images over a network at various speeds was put together. The successful effort introduced legislators to the implications of network bandwidth.

Johnston's group continued its work, evolving from scientific visualization to the idea of operating scienti-fic instruments on line. This work is led by Bahram Parvin in collaboration with the Lab's Material Sciences and Life Sciences Divisions. Last year, several group members patented their system which provides automatic computerized control of microscopic experiments. The system collects video data, analyzes it and then sends a signal to the instruments to carry out such delicate tasks as cleaving DNA molecules and controlling the shape of growing microcrystals.

One key aspect of successful data-intensive computing -- accessing data cached at various sites -- was developed by Berkeley Lab. Called Distributed-Parallel Storage System, or DPSS, this technology successfully provided an economical, high performance and highly scalable design for caching large amounts of data for use by many different users. Brian Tierney continues this project with his team in NERSC's Future Technologies Group.

In May, a team from Berkeley Lab and SLAC conducted an experiment using DPSS to support high energy physics data analysis. The team achieved a sustained data transfer rate of 57 MBytes per second, demonstrating that high-speed data storage systems could use distributed caches to make data available to systems running analysis codes.

Overcoming hurdles

With the development of various components necessary for data intensive computing, the number of obstacles has dwindled. One of the last remaining issues -- scheduling and allocating resources over networks -- is being addressed by "differentiated services." This technology, resulting from work by Van Jacobson's Network Research Group, marks some data packets for priority service as they move across networks. A demonstration by Berkeley Lab in April showed that priority-marked packets arrived at eight times the speed of regular packets when sent through congested network connections. Differentiated services would ensure that designated projects could be conducted by reserving sufficient resources.

The next big step, says Johnston, is to integrate the various components and technologies into a cohesive and reliable package -- a set of "middleware services" that allows applications to easily use these new capabilities.

"We see China Clipper not so much as a system, but as a coordinated collection of services that may be flexibly used for a variety of applications," says Johnston. "Once it takes off, we see it opening new routes and opportunities for scientific discovery."

For more information see http://www-itg.lbl.gov/WALDO/, http://www-itg.lbl.gov/DPSS/, and (for papers) http://www-itg.lbl.gov/~johnston/.

Search | Home | Questions