Because the various web apps were Vital to the Protection of our Country no one could afford to actually do a test of the failover systems, because that would require the entire set being offline for, possibly, a few minutes. Which couldn't be tolerated. Because, you know, might let the next 9/11 happen. (Cue ominous music...)
So one fine day the fire marshall is going through the (unclassified part of) the building and, as part of his inspection, hits the Big Red Switch. All the UPS's start beeping wildly. The generator... Does nothing. The failover switch... Fails. The backup site... Doesn't come up.
For added fun, when the fire marshall tried to turn the building power back on, it wouldn't.
And about 15 minutes later all the web based software used by the FBI, nationwide, dropped offline.
The system I was working on at the time started out as a small demo project that Management took one look at, said "We need that NOW!", after which it promptly went online without any real design. Then it spent several years growing, like a fungus, until it was too unwieldy to maintain anymore. It was never designed, and therefore was never designed to shutdown. There was no procedure for bringing it up from a hard crash like that.
Took 3 days to get everything back online, and we never did figure out if there was any data loss.
The reason the generator didn't start? A $15 relay failed. No idea why the coop site didn't pick up, but they did have a new contractor running the place shortly after.
|< Congratulations to our NFL champs, the Seattle Seahawks! | Veil of terror >|