NetLogger: Sharpening the Search for Supernovae



The SNfactory's data pipeline transports 50 gigabytes of images a night through a high-speed network connecting wide-field telescopes used by the Jet Propulsion Laboratory's Near Earth Asteroid Tracking program (NEAT) at observatories in Hawaii and Southern California, including Mount Palomar, shown here.

The Nearby Supernova Factory (SNfactory), established at Berkeley Lab in 2002, aims to dramatically increase the discovery of nearby Type Ia supernovae by applying assembly-line efficiencies to the collection, analysis, and retrieval of large amounts of astronomical data.

As of spring 2005 the program had aided the discovery of some 150 Type Ia supernovae -- about three times the entire number of these distinctive supernovae reported before the SNfactory was launched. Type Ia supernovae are important celestial bodies because they are used as "standard candles" for gauging the expansion of the universe.

Unclogging the data pipeline

Contributing to the SNfactory's remarkable discovery rate is its custom-developed "data pipeline" software. The pipeline fills with up to 50 gigabytes (billion bytes) of data per night from wide-field cameras built and operated by the Jet Propulsion Laboratory's Near Earth Asteroid Tracking program (NEAT). NEAT uses remote telescopes in Southern California and Hawaii.

Around 25,000 new images are captured each day, and the goal is to complete all processing before the next day's images arrive. Image data is copied in real time from the Mt. Palomar Observatory in Southern California to a mass storage system at the National Energy Research Scientific Computing Center (NERSC), based at Berkeley Lab. There the image data is copied to a shared disk array on a 344-node cluster called the Parallel Distributed Systems Facility. The images are big — 8 megabytes (million bytes) apiece, uncompressed — and processing each requires between five and 25 reference images, adding up to a total disk space of about 0.5 terabytes (trillion bytes) needed each day.

Supernovae are found by comparing recently acquired telescope images with older reference images. If there is a source of light in the new image that did not exist in the old one, it could be a supernova. New light sources are identified by subtracting the recent image from the reference image, a delicate procedure that entails aligning the images, matching the point-spread functions, and matching the photometry and bias, all of which require precise calibration.

Because of the high demand put on all the resources in the pipeline, making sure that data flow smoothly and can be analyzed quickly and correctly is critical to the SNfactory's overall success. While there are a number of tools for evaluating the performance of single systems, identifying workflow bottlenecks in a distributed system like the SNfactory requires a different kind of application.

For the past 10 years, Brian Tierney and others in the Collaborative Computing Technologies Group of Berkeley Lab's Computational Research Division (CRD) have been developing the NetLogger toolkit. Part of the Distributed Monitoring Frameworkproject, NetLogger is a set of libraries and tools to support end-to-end monitoring of distributed applications. Recently the team worked closely with the SNfactory to help debug and tune their application.

"NetLogger has been extremely useful in the debugging and commissioning of our data processing pipeline," says Stephen Bailey of the Physics Division, one of the lead developers of the SNfactory. "It has helped us identify bugs and processing bottlenecks in order to improve our efficiency and data quality. Additionally it has allowed real-time monitoring of the data processing to quickly identify problems that need immediate attention. This debugging, commissioning, and monitoring would have taken much longer without NetLogger."

NetLogging the Supernova Factory

"The first problem the SNfactory scientists asked us to solve was to figure out why some of their workflows where failing without any error messages as to the cause," Tierney says. "Even when error messages were generated, the SNfactory application produced thousands of log files, and it was very difficult to locate the log messages related to failed workflows. NetLogger was very useful for easily characterizing where the failures were occurring, so they would know where to focus debugging efforts."



	A bug in SNfactory processing, shown here in the workflow on a single cluster node, went undetected for several months before analysis by NetLogger: the ascending lines should have converged to yield a setskyflat event. (Horizontal lines at bottom are CPU and network data.) The nearly vertical lines near the beginning were read as errors but in fact mark correct "dark images," essential to the calculation. Once the workflow was able to identify these as completed processes, the problem was solved.

The SNfactory application processes a group of images together, starting by uncompressing them and then performing image calibration and subtraction. The next step is to generate a skyflat image: a calibration image that is formed from a median combination of several other images. The skyflat image is used to correct other images to adjust for the sky's brightness on a given night, which can vary due to humidity, cloud cover, and other conditions. The skyflat calibration image is then applied to all images within the job.

Adjusting for sky conditions in a given batch requires generating "dark images," which are only processed to a certain point, then halted. Although they need no further processing, these are essential to other calculations. SNfactory personnel assumed that under some conditions skyflat calibration wouldn't be needed, but this turned out to be a mistake.

To track down a problem SNfactory scientists knew was present in adjusting for sky brightness, but which they could not locate (partly because of the absence of error messages), a new NetLogger anomaly-detection tool was applied. Because its algorithm is based on whether a workflow executes its last event, the dark images, which stopped early, were flagged as anomalies even though they were completed (and necessary) tasks.

Once the problem was located it was easily solved, by introducing an artificial event that all workflows could write when they came to the end of their normal processing task, no matter how truncated. Now any completed processing task appears with the keyword "done."

NetLogger analysis of the Nearby Supernovae Factory workflow discovered important bugs, demonstrating the importance of the anomaly detection tool. In future NetLogger will be able to recognize more kinds of anomalies automatically. The programmers are also exploring database integration issues. Future enhancements of whatever nature will continue to be based on the philosophy of simplicity first.

Additional information

"Scalable analysis of distributed workflow traces" was presented by Brian Tierney, Stephen Bailey, and Dan Gunter of CRD's Collaborative Computing Technologies Group at the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications in Las Vegas, Nevada.

Top