Cosmic Ray Showers Crash Supercomputers. Here’S What To Do About It
The Cray-1 supercomputer, the world’s fastest back in the 1970s, does not look like a supercomputer. It looks like a mod version of that carnival ride The Round Up, the one where you stand, strapped in, as it dizzies you up. It’s surrounded by a padded bench that conceals its power supplies, like a cake donut, if the hole was capable of providing insights about nuclear weapons.
After Seymour Cray first built this computer, he gave Los Alamos National Laboratory a six-month free trial. But during that half-year, a funny thing happened: The computer experienced 152 unattributable memory errors. Later, researchers would learn that cosmic-ray neutrons can slam into processor parts, corrupting their data. The higher you are, and the bigger your computers, the more significant a problem this is. And Los Alamos—7,300 feet up and home to some of the world’s swankiest processors—is a prime target.
The world has changed a lot since then, and so have computers. But space has not. And so Los Alamos has had to adapt—having its engineers account for space particles in its hard- and software. “This is not really a problem we’re having,” explains Nathan DeBardeleben of the High Performance Computing Design group. “It’s a problem we’re keeping at bay.”
For modern supercomputers, starting with one called Q, this is a big deal. Installed in 2003, Q was much quicker than the Cray-1, and it churned through calculations on the country’s nest-egg of nuclear weapons. But it crashed more than expected—the first failures that caused Los Alamos scientists to really worry about cosmic rays, charged particles that come from outer space. They collide with the chemicals in the atmosphere, and the whole mess breaks apart into smaller particles. “They literally make these showers that just rain down on us,” says Sean Blanchard, from the High Performance Computing Design group. And some of the raindrops are neutrons—which are bad news.
“They can cause computer memory to flip bits,” says De Bardeleben, “a 0 to 1 or 1 to 0.” That doesn’t much matter for your home computer. But Los Alamos has big number-crunchers. The early-aughts’ Q, for instance, called to mind grocery-store aisles. And today, the facility has racks of computers the size of a football field, and all the computers in that football field may be working on solving the same problem. Just as a football field sees a larger volume of rain than a back yard, supercomputers see more cosmic ray neutrons than your MacBook.
After Q, the lab’s engineers truly understood that neutrons are not neutral parties, so now they try to preempt problems. Before Los Alamos installs new equipment, like its Trinitymachine, engineers perform a kind of cosmic stress-test, placing the electronics in a beam of neutrons—many more than cascade from the sky at any given time—and watching what happens. “We take parts and make them radioactive and make them crash,” explains Blanchard. They will also soon place neutron detectors inside the supercomputing center, to measure the strength of the storm. If you know how many neutrons you’re getting, and you know how they make computer parts behave, “you can predict the lifetime of your electronics,” says Suzanne Nowicki, a physicist in the lab’s space science and applications group.
Supercomputers are usually smart enough to know if something has gone wrong, to feel that flipped bit like you’d feel someone tugging on a single strand of hair. And when that happens, the system simply usually reports the error and rights itself. But sometimes, says Blanchard, the computer is more pessimistic. “I have an error. Too many bits flipped,” he mimics. “I can’t fix it, but I wanted you to know it happened.”
When that occurs at Los Alamos, they crash the computers—intentionally. It’s like falling down on purpose when you’re skiing, because that will hurt less than whatever else is about to happen. But you don’t have to walk back to the top of the slope and start all over again: The engineers have created “checkpoints” throughout the quest to answers. It’s like the save-spots in video games: If you die, you don’t have to start all over. You start at the last spot you cached your achievements. Supercomputers can do the same kind of save.