NERSC First To Reach Goal of Seamless Shutdown, Restart of SupercomputerAugust 22, 1997By Jon Bashor, jbashor@lbl.gov
Called "checkpointing," the stop/restart procedure--achieved twice in one week
at NERSC--is believed to be the first time such a procedure has been
accomplished on a massively parallel processor (MPP) supercomputer.
Checkpointing has been a major goal in the MPP community for the 10 years since
the first parallel machine was plugged in. C. William McCurdy, head of the
Lab's Computing Sciences, called the procedure "a remarkable achievement."
Checkpointing involves bringing all of the programs running on the computer to
the same stage and stopping them, then recording all the information,
transferring that information out of the machine to allow work on the system,
then putting it back in and getting it all running again--on a machine capable
of carrying out tens of billions of operations per second. A simple analogy
would be to give 1,000 kindergartners each 10 crayons (which they are
encouraged to share) and a picture to color in 10 stages. The teacher would
then get them all to stop at the same time at the fourth stage, collect all the
pictures and crayons, put them away on a shelf and then return all the
materials to their original student the next day.
"As far as I know, no other MPP system is planning to do system-wide
checkpoint/restart without having to reprogram applications," said Bill Kramer,
deputy director of NERSC. "Therefore, this is really a momentous step for those
of us in the high-performance computing community."
Successfully checkpointing will allow the NERSC staff to use the 512 processors
of its CRAY T3E more efficiently by moving jobs between the processors and
making larger pools of processors available quickly for bigger jobs. It will
also allow NERSC to make the entire 512-processor computer available to tackle
a single, complex problem when necessary, as well as carry out upgrades and
maintenance without disrupting the work of hundreds of researchers from around
the country.
"This signifies a major milestone in Cray's and NERSC's commitment to provide
robust, reliable Massively Parallel Processor computing cycles to DOE's
unclassified energy research community," said Michael Declerck of the NERSC
Systems Group. Declerk is the computer scientist charged with putting the
T3E-900, which was delivered in mid-July, through its month-long acceptance
tests.
The successful checkpointing was the result of software developed by Cray
Research Inc. and close collaboration between NERSC and Cray to refine the
application. The procedure was successfully demonstrated on both of NERSC's
CRAY T3E supercomputers, the 512-processor T3E-900 and the 160-processor
T3E-600. The checkpointing was performed once to allow scheduled maintenance
and a second time to test advanced operating system features. The restarted
jobs were running on clusters ranging from 16 to 256 processors.
"After we completed the downtimes, all of the user jobs on the machine were
successfully restarted and the machines were put back on line," said James
Craw, head of the NERSC Systems Group. "It's kind of ironic that we achieved
this major milestone and none of our users noticed--which was our objective."
Search | Home | Questions |