Cray Inc. Tests Red Storm Systems Software at NERSC Center

November 26, 2003

science beat | current article | lab a-z index | lab home

"Red Storm" is the name of a massively-parallel-processing supercomputer that Cray Inc. and the Department of Energy's Sandia National Laboratories are developing for the Advanced Simulation and Computing program (ASCI) of DOE's National Nuclear Security Administration (NNSA).

Recently a team of experts in high-performance computing from Cray and DOE's National Energy Research Scientific Computing (NERSC) Center completed initial scalability tests of Red Storm's operating system and message-passing functions. The successful effort paves the way for the next stage, testing input/output (I/O) for two potential Red Storm file systems.

Under a $90 million, multiyear contract announced last year, Cray will collaborate with Sandia to develop and deliver Red Storm, which is expected to become operational in 2004. Cray will deliver a system with theoretical peak performance of 40 trillion calculations per second (teraflop/s). Late in October, 2003, Cray announced that it also plans to sell systems based on Red Storm to other customers.

Red Storm will be located at Sandia, a multiprogram laboratory managed by the Lockheed Martin Corporation for the NNSA. When executing actual defense problems, this new system is expected to be at least seven times more powerful than Sandia's current ASCI Red supercomputer, the first supercomputer delivered under the ASCI program.

A virtual supercomputer

For the Red Storm tests, the NERSC Center based at Lawrence Berkeley National Laboratory provided Cray researchers with access to a 174-processor Linux cluster with 87 dual-processor nodes — a cluster named "Alvarez" in honor of Berkeley Lab physicist and Nobel Laureate Luis Alvarez.

"Red Storm's Linux-based software allows simulation of multiple virtual processors per physical processor, and using this we ran simulations of up to 1,000 processors on the Alvarez machine," explained Gail Alverson, Cray operating systems manager, when the tests were complete. "By running several Alvarez log-in processors as Red Storm log-in processors, and the Alvarez compute processors as Red Storm compute processors — using our Linux based IA32 versions of the compute-processor software — we could run layered on the existing Linux running on the machine. Not having to reboot the nodes was convenient for both Cray and NERSC."

One of the motivations for obtaining the Alvarez cluster was to assess various technologies for future high-performance computing systems. "The NERSC Center has long been a leader in testing and deploying leading-edge systems and this collaborative effort with Cray is an extension of our efforts to provide the DOE scientific community with systems to advance scientific research," said Bill Kramer, general manager of the NERSC Center.

Significant advances

Using Alvarez about one day a week for two months, the Cray team reported making advances in two major areas: system software and administration, and system scalability.

In the area of system configuration and administration, the Alvarez runs were some of the first runs made with all of the Red Storm software in place: Yod, PCT, PBS, mySQL, RCA, and CPA.

"Consequently, it was on this platform that we developed a set of scripts for system configuration and easier job launch, as well as worked through a set of installation issues," Alverson noted. "Development on Alvarez was transferred back to the Cray internal systems and continues to be used by the system test and integration group."

In the area of system scalability, advances were made on a number of fronts:

Launch-time experiments provided some initial data on launch time for small and large (executable size) programs. While absolute times will differ on the real Red Storm hardware, the shape of the graph conveyed good information and exposed potential areas of future tuning.
Experiments with simple MPI programs and HPL showed software functionality scaling to 1,000 processors. "This was a significant result for us," Alverson reported.
Experiments with real applications, such as CTH, and ITS, allowed the team to work with the codes and system with smaller numbers of virtual processors.
The team found and fixed approximately 20 system bugs, from the portals layer up through MPI and system launch.

Future directions

While Cray has reached the end of their use of the Alvarez system for initial OS and MPI functional scalability testing, the I/O team is now ready to take over. Cray's I/O team is starting to work with NERSC to use the GUPFS (Global Unified Parallel File System) testbed platform, and GUPFS joined together with Alvarez, to do scalability testing of two potential Red Storm file systems, PVFS and Lustre. They will also collaborate with NERSC staff who are working on global file systems that will eventually be deployed within the NERSC Center.

Additional information

Established in 1974, the NERSC Center is DOE's flagship facility for unclassified supercomputing. NERSC is housed at Lawrence Berkeley National Laboratory; Berkeley Lab conducts only unclassified research and is managed by the University of California for the U.S. Department of Energy. More about NERSC
Cray is the premier provider of supercomputing solutions for its customers' most challenging scientific and engineering problems. More about Cray

Top