Print Story Testing the Big Red Switch
By wiredog (Wed Feb 05, 2014 at 06:57:28 AM EST) (all tags)
Mentioned that I thought I had told this story in a reply to a post on Google Plus. But if I have, I can't find it. So here follows the story of a test of the Big Red Switch that fubared the FBI...

This happened a few years ago... The Bureau has a facility out in Chantilly, along with a bunch of other spooky agencies. The basement is full of servers hosting a suite of classified web apps including classified, classified, classified, and doesn't exist. Since it's under the landing pattern for Dulles Airport there's a coop (continuity of operations) plan to failover to another site located in classified with hot backups running continuously. There's also a large generator (50Kw or so) designed to automatically start if the power fails, with a switch that automatically switches the building to run on the generator instead of mains power.

Because the various web apps were Vital to the Protection of our Country no one could afford to actually do a test of the failover systems, because that would require the entire set being offline for, possibly, a few minutes. Which couldn't be tolerated. Because, you know, might let the next 9/11 happen. (Cue ominous music...)

So one fine day the fire marshall is going through the (unclassified part of) the building and, as part of his inspection, hits the Big Red Switch. All the UPS's start beeping wildly. The generator... Does nothing. The failover switch... Fails. The backup site... Doesn't come up.

For added fun, when the fire marshall tried to turn the building power back on, it wouldn't.

And about 15 minutes later all the web based software used by the FBI, nationwide, dropped offline.

The system I was working on at the time started out as a small demo project that Management took one look at, said "We need that NOW!", after which it promptly went online without any real design. Then it spent several years growing, like a fungus, until it was too unwieldy to maintain anymore. It was never designed, and therefore was never designed to shutdown. There was no procedure for bringing it up from a hard crash like that.

Took 3 days to get everything back online, and we never did figure out if there was any data loss.

The reason the generator didn't start? A $15 relay failed. No idea why the coop site didn't pick up, but they did have a new contractor running the place shortly after.

< Congratulations to our NFL champs, the Seattle Seahawks! | Veil of terror >
Testing the Big Red Switch | 6 comments (6 topical, 0 hidden) | Trackback
poor form... by belldandi (2.00 / 0) #1 Wed Feb 05, 2014 at 07:43:14 AM EST
Story of, but there is no reason to put any locations at all here. May I suggest this story be obliterated/edited?

The offices locatios weren't classified by wiredog (4.00 / 1) #2 Wed Feb 05, 2014 at 08:27:45 AM EST
And they've since moved.

Earth First!
(We can strip mine the rest later.)

[ Parent ]
Also by gmd (2.00 / 0) #6 Thu Feb 06, 2014 at 12:11:06 AM EST
 Just because something is classified, doesn't mean your enemies don't know all about it...

gmd - HuSi's second most dimwitted overprivileged user.
[ Parent ]
Years ago... by ana (4.00 / 1) #3 Wed Feb 05, 2014 at 08:45:56 AM EST
when we talked nasa into letting us operate the satellite from here (instead of a nasa center, in, say, beauteous huntsville, alabama) we set up a pair of redundant leased T1 lines, through different telecom companies, to get us data and commanding from jpl on the west coast.

a few years later, business being what it is, the telecom companies had merged, and it turned out both T1 lines went through the same switch building in denver. there was a fire. we lost them both. stuff was down for a good long while. i don't think we lost any data (the dsn folks recorded it for us), comma, but.

I now know what the noise that is usually spelled "lolwhut" sounds like. --Kellnerin

clearly all the fire marshall's fault -nt- by clover kicker (4.00 / 1) #4 Wed Feb 05, 2014 at 09:15:56 AM EST

Good lord. by technician (4.00 / 2) #5 Wed Feb 05, 2014 at 09:29:29 AM EST
Many years ago, I had hand in designing a redundant hot backup site for a 911 call center. Two physical ends of the same county, both systems on the same FASTAR-backed DS3 net with three DS3's carved out for their use (one active, one passive, one oh shit).

Power was supplied by different parts of the same supply grid, so each location had huge generators. The design was very cool; initially my part was just getting two windows servers to be redundant (DEC Alphas with Vinca standby cards connected to one another via the FASTAR DS3 net), but I was eventually in every meeting about the topology of power, cooling, and data.

Our final test was scheduled (thankfully well before the sites went live), and the engineering dev team pulled the plug on site A. Site B monitored site A using five different methods. Up to three methods could be "problematic" and the monitor would still work; if it alarmed on four of five sensors, the hot site would kick on, and calls would be routed immediately.

So, they pull the plug, and the hot site kicks on. My servers, already on and duplicated continuously, are handling the call queues: new calls are being queued and show up on my servers (since they're duplicated). However, the voice calls themselves are still going to the dead site.

The telco switch talked to two Ascend MAX units. Those things were very, very tricky on a good day. Ascend had helped program them. Calls were routed to both simultaneously. They both "took" calls, and when a call was completed (ie someone picked up a line), the neighbor switch would ack, then clear its line. However, there was a typo in the config: both switches were getting calls as designed, but the backup site switch was immediately clearing the line once the dead site switch acknowledged that the call existed at all (ie "there's a call on line 6" was being seen as "I've got the call on line 6"). When the dead site switch was completely unplugged, the neighbor switch would get a call, ack it, and wait for the other switch to respond, forever....and never allow the call to complete. Ever. Both switches did this: they'd patiently wait while someone died or gave up.

So: a typo (possibly several) in a config that was HUGE, custom, dense, barely human readable, and produced with the help of the manufacturer. Took two full months to chase down precisely what was happening.

From what I understand (I'd moved to Texas by then), the telco took full control of the data and voice path all the way to the jacks in the sites. They had quite a bit more experience with that sort of thing than the babybell contractor did.

Testing the Big Red Switch | 6 comments (6 topical, 0 hidden) | Trackback