NERSC First To Reach Goal of Seamless Shutdown, Restart of Supercomputer

August 22, 1997

By Jon Bashor, jbashor@lbl.gov

The National Energy Research Scientific Computing Center (NERSC) this month achieved a milestone in high-performance computing: successfully stopping and restarting a number of scientific computing jobs on a CRAY T3E supercomputer without any data processing loss or discontinuity.

Called "checkpointing," the stop/restart procedure--achieved twice in one week at NERSC--is believed to be the first time such a procedure has been accomplished on a massively parallel processor (MPP) supercomputer. Checkpointing has been a major goal in the MPP community for the 10 years since the first parallel machine was plugged in. C. William McCurdy, head of the Lab's Computing Sciences, called the procedure "a remarkable achievement."

Checkpointing involves bringing all of the programs running on the computer to the same stage and stopping them, then recording all the information, transferring that information out of the machine to allow work on the system, then putting it back in and getting it all running again--on a machine capable of carrying out tens of billions of operations per second. A simple analogy would be to give 1,000 kindergartners each 10 crayons (which they are encouraged to share) and a picture to color in 10 stages. The teacher would then get them all to stop at the same time at the fourth stage, collect all the pictures and crayons, put them away on a shelf and then return all the materials to their original student the next day.

"As far as I know, no other MPP system is planning to do system-wide checkpoint/restart without having to reprogram applications," said Bill Kramer, deputy director of NERSC. "Therefore, this is really a momentous step for those of us in the high-performance computing community."

Successfully checkpointing will allow the NERSC staff to use the 512 processors of its CRAY T3E more efficiently by moving jobs between the processors and making larger pools of processors available quickly for bigger jobs. It will also allow NERSC to make the entire 512-processor computer available to tackle a single, complex problem when necessary, as well as carry out upgrades and maintenance without disrupting the work of hundreds of researchers from around the country.

"This signifies a major milestone in Cray's and NERSC's commitment to provide robust, reliable Massively Parallel Processor computing cycles to DOE's unclassified energy research community," said Michael Declerck of the NERSC Systems Group. Declerk is the computer scientist charged with putting the T3E-900, which was delivered in mid-July, through its month-long acceptance tests.

The successful checkpointing was the result of software developed by Cray Research Inc. and close collaboration between NERSC and Cray to refine the application. The procedure was successfully demonstrated on both of NERSC's CRAY T3E supercomputers, the 512-processor T3E-900 and the 160-processor T3E-600. The checkpointing was performed once to allow scheduled maintenance and a second time to test advanced operating system features. The restarted jobs were running on clusters ranging from 16 to 256 processors.

"After we completed the downtimes, all of the user jobs on the machine were successfully restarted and the machines were put back on line," said James Craw, head of the NERSC Systems Group. "It's kind of ironic that we achieved this major milestone and none of our users noticed--which was our objective."

Search | Home | Questions