Print Story Phone System Squabbles
By Phil the Canuck (Tue Apr 21, 2009 at 01:36:30 PM EST) (all tags)
When servers fight over territory, everyone loses.

In theory, our phone system is designed with decent fault tolerance given our size.  Two servers, primary and backup, housed at separate hub sites.  When the backup is unable to communicate with the primary it takes over.  When communication is restored, the two are supposed to negotiate which one will remain the live server.  In theory the primary holds a trump card that will always cause it to win any such negotiation.  What I learned today is, "oh yeah, that's a bit buggy."

A bit buggy.

A fiber transceiver died at our backup site.  Not our equipment, in the fiber box, so I got to have a nice chat with a fiber tech as he went to work.  At first he thought it was a patch cable, which would be a rare thing but made sense given what we were seeing.  He replaced it, everything came back up.  Then as he was packing up it dropped again for a few seconds.  That's when he ended up finding the transceiver problem.

While this was happening, our two servers got a chance to catch up on old times.  They started negotiations, and the backup won.  This is where the bug comes in.  The primary gave in to the backup. 

Here's another bug that's just neat-o.  When one of our servers gives up primary status, it can't accept a switchover until it reboots (OK, to be fair I could probably just restart some services.  Since restarting the appropriate services takes about 95% of the server's boot time I've never bothered to check).  So the primary gave in to the backup, and the fiber tech started monkeying with the patch and transceiver.  Up-down, up-down, up-down. 

I had to power down the backup and reboot the primary so it would take charge again.  I just got back from powering the backup back on, because with my luck the primary would crash tonight if I didn't.

I'm still answering questions.

Also, a stump-dumb program director has asked for something three times in the last week, despite being told each time that it was impossible.  We had just figured it's because she's special, but it turns out she's run to The Boss each time we say no and he assures her it's all no problem.  Thanks douche.  The best part is that I've had two conversations with him where we agree this is not possible.  He just can't say no to people, or rather has learned that if he says yes he can leave without being drawn into a conversation.  It's all good, because it makes work for us and not him.

patch cable, which would be a rare thing by wiredog (2.00 / 0) #1 Tue Apr 21, 2009 at 01:45:59 PM EST
I dunno, I always check the cables first. They've probably had more mechanical stress, and they're easy to test.

Switch settings, too. Another easy fix.

I dunno by Phil the Canuck (2.00 / 0) #2 Tue Apr 21, 2009 at 02:19:20 PM EST
A fiber patch, run through a protective (SERIES OF) tube from one stationary object to another.

Good point by wiredog (4.00 / 1) #4 Tue Apr 21, 2009 at 02:21:44 PM EST
Still the easiest thing to test...

Heh by Gedvondur (4.00 / 1) #7 Tue Apr 21, 2009 at 04:24:35 PM EST
Always gotta check the easy shit first. 

Been bit in the ass too many times not to check all of the easy/stupid stuff first. 

Then its the Fonz method of percussive maintenance.


Is by sasquatchan (4.00 / 1) #3 Tue Apr 21, 2009 at 02:20:57 PM EST

I can't tell, it's too dark to see by georgeha (4.00 / 1) #5 Tue Apr 21, 2009 at 02:35:29 PM EST

You are likely to be eaten by a grue. by gzt (4.00 / 2) #6 Tue Apr 21, 2009 at 03:03:01 PM EST

I'm reminded of the Morris worm. by dark nowhere (2.00 / 0) #8 Tue Apr 21, 2009 at 08:18:35 PM EST
It would have gone undetected (and probably innocuous) for far longer if it didn't have bugs in a similar place. (At least your phone system hasn't shut down the internet at large.)

IIRC by bobdole (2.00 / 0) #9 Wed Apr 22, 2009 at 02:58:20 AM EST
The morris worm was designed to be a harmless "exploring the size of the network"-exercise rather than the denial of service it turned out to be...

That is correct. by dark nowhere (2.00 / 0) #10 Wed Apr 22, 2009 at 05:06:39 PM EST
Unfortunately, it couldn't keep its population down as intended.

