A newly funded computer research program at
Berkeley Lab could revolutionize the way scientific instruments, computers and humans work
together to gather, analyze and use data. Funded by the U.S. Department of Energy, the
program will build on efforts made over the past 10 years to gather and store information
and make it available over computer networks. The program is called "China
Clipper" in reference to the 1930s commercial air service which spanned the Pacific
Ocean and opened the door to today's global air service.
"I believe that our China Clipper project epitomizes the research environment we
will see in the future," says Bill Johnston, leader of the Lab's Imaging and
Distributed Computing Group. "It will provide an excellent model for online
scientific instrumentation. Data are fundamental to analytical science, and one of my
professional goals is to greatly improve the routine access to scientific data --
especially very large datasets -- by widely distributed collaborators, and to facilitate
its routine computer analysis."
The idea behind China Clipper, like the pioneering air service, is to bring diverse
resources closer together. In this case, scientific instruments such as electron
microscopes and accelerators would be linked by networks to data storage
"caches" and computers. China Clipper will provide the "middleware" to
allow these separate components, often located hundreds or thousands of miles apart, to
function as a single system. Johnston is scheduled to discuss the Lab's work in this area
next week at an IEEE symposium on High Performance Distributed Computing.
Data intensive computing
Modern scientific computing involves organizing, moving, visualizing, and analyzing
massive amounts of data from around the world, as well as employing large-scale
computation. The distributed systems that solve large-scale problems involve aggregating
and scheduling many resources. For example, data must be located and staged, and cache and
network capacity must be available at the same time as computing capacity.
Every aspect of such a system is dynamic: locating and scheduling resources, adapting
running application systems to availability and congestion in the middleware and
infrastructure, and responding to human interaction. The technologies, services and
architectures used to build useful high-speed, wide area distributed systems constitute
the field of data intensive computing.
Enhancing data intensive computing will make research facilities and instruments at
various DOE sites available to a wider group of users. Berkeley Lab scientists are
developing China Clipper in collaboration with their counterparts at the Stanford Linear
Accelerator Center, Argonne National Laboratory. and DOE's Energy Sciences Network, or
ESnet.
"This will lead to a substantial increase in the capabilities of experimental
facilities," predicts Johnston.
Faster turnaround of information
As an example of benefits, Johnston cites a project called "WALDO" (Wide Area
Large Data Object) which makes it possible for physicians to have immediate access to
patients' medical images. Johnston's group -- together with Pacific Bell, Livermore's NTON
optical network testbed project and others -- worked with Kaiser Permanente to produce a
prototype online, distributed, high-data-rate medical imaging system. The project allowed
cardio-angiography data to be collected directly from a scanner in a San Francisco
hospital. The system was connected to a high-speed Bay Area network, allowing data to be
collected, processed, and stored at Berkeley Lab and accessed by cardiologists at the
Kaiser Oakland hospital. Currently such images are processed and kept at a central office,
and it can take weeks for doctors to see one or two images. With the WALDO real-time
acquisition and cataloguing approach, they had access in a few hours.
Better research
This work is guided by the vision that faster access to data will allow scientists to
conduct their work more efficiently and gain new insights. Research often starts out with
a scientific model. Scientists then conduct an experiment and compare the actual results
with what was expected. Understanding the resulting differences is where the real science
happens, Johnston says. China Clipper is expected to lead to better utilization of
instrumentation for experiments and fast comparisons of actual experiments and
computational models. Streamlining the test-and-compare process could significantly
increase the rate of scientific discovery.
Evolution of an idea
China Clipper is the culmination of a decade of research and development of high-speed,
wide area, data intensive computing. The first demonstration of the project's potential,
Johnston said, came during 1989 hearings held by then-Senator Al Gore on his High
Performance Computing and Communications legislation. Because the Senate room had no
network connections at the time, a simulated transmission of images over a network at
various speeds was put together. The successful effort introduced legislators to the
implications of network bandwidth.
Johnston's group continued its work, evolving from scientific visualization to the idea
of operating scienti-fic instruments on line. This work is led by Bahram Parvin in
collaboration with the Lab's Material Sciences and Life Sciences Divisions. Last year,
several group members patented their system which provides automatic computerized control
of microscopic experiments. The system collects video data, analyzes it and then sends a
signal to the instruments to carry out such delicate tasks as cleaving DNA molecules and
controlling the shape of growing microcrystals.
One key aspect of successful data-intensive computing -- accessing data cached at
various sites -- was developed by Berkeley Lab. Called Distributed-Parallel Storage
System, or DPSS, this technology successfully provided an economical, high performance and
highly scalable design for caching large amounts of data for use by many different users.
Brian Tierney continues this project with his team in NERSC's Future Technologies Group.
In May, a team from Berkeley Lab and SLAC conducted an experiment using DPSS to support
high energy physics data analysis. The team achieved a sustained data transfer rate of 57
MBytes per second, demonstrating that high-speed data storage systems could use
distributed caches to make data available to systems running analysis codes.
Overcoming hurdles
With the development of various components necessary for data intensive computing, the
number of obstacles has dwindled. One of the last remaining issues -- scheduling and
allocating resources over networks -- is being addressed by "differentiated
services." This technology, resulting from work by Van Jacobson's Network Research
Group, marks some data packets for priority service as they move across networks. A
demonstration by Berkeley Lab in April showed that priority-marked packets arrived at
eight times the speed of regular packets when sent through congested network connections.
Differentiated services would ensure that designated projects could be conducted by
reserving sufficient resources.
The next big step, says Johnston, is to integrate the various components and
technologies into a cohesive and reliable package -- a set of "middleware
services" that allows applications to easily use these new capabilities.
"We see China Clipper not so much as a system, but as a coordinated collection of
services that may be flexibly used for a variety of applications," says Johnston.
"Once it takes off, we see it opening new routes and opportunities for scientific
discovery."
For more information see http://www-itg.lbl.gov/WALDO/,
http://www-itg.lbl.gov/DPSS/, and (for papers) http://www-itg.lbl.gov/~johnston/.