Print Story A Day in the Life
Running Around in Circles

With locations in the US and Europe a customer had chronic speed problems in Europe. E-Mail was slow, adding attachments resulted in time-outs, queries could take half an hour, often timing out as a result.

Inside I will make you say "LOL WHATTF".

x-posted from brogspot



Technical stuff:
No Web proxy
Available bandwidth:

US and Germany have own IPs, UK connects through them
US has a dedicated T1, Germany has a 2MB DSL, UK has 3MB "Internet Access"
Germany's Trace Route:
2 * * * Zeitüberschreitung der Anforderung.
3 * * * Zeitüberschreitung der Anforderung.
4 * * * Zeitüberschreitung der Anforderung.

etc.

Germany's Ping:
Antwort von x.x.x.132: Bytes=128 Zeit=205ms TTL=226
Antwort von x.x.x.132: Bytes=128 Zeit=201ms TTL=226
Zeitüberschreitung der Anforderung.
Zeitüberschreitung der Anforderung.
...
Zeitüberschreitung der Anforderung.
Zeitüberschreitung der Anforderung.
Antwort von x.x.x.132: Bytes=128 Zeit=202ms TTL=226
Antwort von x.x.x.132: Bytes=128 Zeit=173ms TTL=235
Zeitüberschreitung der Anforderung.
...
Zeitüberschreitung der Anforderung.
Zeitüberschreitung der Anforderung.
Antwort von x.x.x.132: Bytes=128 Zeit=404ms TTL=234
Antwort von x.x.x.132: Bytes=128 Zeit=398ms TTL=234
Antwort von x.x.x.132: Bytes=128 Zeit=404ms TTL=234
Antwort von x.x.x.132: Bytes=128 Zeit=398ms TTL=234
...
Antwort von x.x.x.250: Bytes=56 (gesendet 128) Zeit=3570ms TTL=228
Antwort von x.x.x.250: Bytes=56 (gesendet 128) Zeit=191ms TTL=228
Antwort von x.x.x.250: Bytes=56 (gesendet 128) Zeit=192ms TTL=228

That's some serious delay going on, and it got worse as soon as the US would come on-line. Whatever the Europeans wanted to get done, they needed to be finished with it by around 2:00p.m.

We got a network diagram and information on settings and everything looked kosher. There were no 404s, just time-outs and really slow response times. So bad was this packet loss, in fact, that they'd often get "Cannot find server or DNS Error" messages. Basic look-ups (many of which you'd expect to have been cached) were failing.

After two weeks of exhausting the possibility of hardware errors, we started inspecting the application servers. More people were brought in. We set logging to the highest possible and painfully checked through megs and megs of information. Surprisingly enough, despite the high amount of logging (normally not done during production hours), there was no worsening of the system.

There was also no answer in sight.

We turned to the database. Even more people were brought in. The generated SQL and activities were examined and everything looked fine, at least for those requests which actually made it to the database. Many didn't.

At least, not until around 4:30p.m. Eastern, when the system would slowly come back from the dead as people started going home.

After more than a month with no results, we tore into the data being sent through the network. Snort didn't detect any DOSing and Ethereal showed that the highest volume hits were to music related sites. After being informed of this, the customer installed WebTrends to better monitor what was going on. Still, while it could slow down the network, even if everyone had been streaming music, it couldn't cause the problems we saw.

So we logged every single packet of data. The Ethereal logs spanned gigabytes.

Meanwhile, the network was getting a little better. Management sent out a note that WebTrends saw some things they didn't like and anyone caught downloading music would be fired. An awful lot of their employees went straight to Add/Remove Programs that afternoon. But the problems continued.

This company had been only barely able to work for the past six weeks and we were running out of options.

More traffic analysis showed that a full 5% of all traffic was going from one machine to www.deviantart.com. That was second only to their mail server (with 6.5%). It got stranger: 30% of all incoming connections were coming from adelphia.net. They'd stopped downloading and started streaming. Clever ell-users.

Our application accounted for only 3% of network traffic. Adelphia and a few other sites were firewalled and things picked up a bit. There were still constant delay problems but time-outs had mostly disappeared.

Work had been going on for a full two months, non-stop, and it didn't seem like we were ever going to figure this out. And then we got an E-Mail:

You can close this case. Thanks to $EtherealLogReaderGuy's help we discovered that they had some jokers who were on partypoker.com most of the day. They've been fired and US performance is OK.
Nine fucking weeks. Twenty-eight gigabytes of logs. More than a dozen of our people. All because two utter fucktards were playing poker on some central servers.

Root Cause: 17-Fuckwit

< Kung-fu hamsters | BBC White season: 'Rivers of Blood' >
A Day in the Life | 14 comments (14 topical, 0 hidden) | Trackback
How does poker take down an international network? by JrSysAdmin (4.00 / 2) #1 Tue May 09, 2006 at 02:32:31 AM EST


1280x2048 video strip poker? by martingale (4.00 / 2) #3 Tue May 09, 2006 at 02:38:27 AM EST
nice thought.
--
$E(X_t|F_s) = X_s,\quad t > s$
[ Parent ]
LOL WHATTF?! by Cloaked User (4.00 / 3) #2 Tue May 09, 2006 at 02:37:23 AM EST
I too am curious as to how two guys playing poker caused so many problems - was it really sucking that much bandwidth?


--
This is not a psychotic episode. It is a cleansing moment of clarity.
I'm wondering if by ucblockhead (2.00 / 0) #11 Tue May 09, 2006 at 07:03:51 AM EST
"Poker" is just the story to either prevent embarassment or avoid alerting the RIAA/MPAA to the 15 terabytes of copyrighted material.

One of the guys in our office was player poker for a while (or maybe two) and it had no effect that I could see on our 1.1 mbps line, shared by ten or so.
---
[ucblockhead is] useless and subhuman

[ Parent ]
Don't think so. by ReallyEvilCanine (2.00 / 0) #12 Tue May 09, 2006 at 08:45:03 AM EST
This customer doesn't have any ties to RIAA/MPAA that I know of, but now I'm pretty sure what I'm posting tomorrow.

the internet: amplifier of stupidity -- discordia

[ Parent ]
No by ucblockhead (2.00 / 0) #13 Tue May 09, 2006 at 08:47:15 AM EST
I mean, they don't want to get sued by the RIAA/MPAA for all the pirated content employees put on their servers.
---
[ucblockhead is] useless and subhuman
[ Parent ]
That sounds more likely by Cloaked User (2.00 / 0) #14 Tue May 09, 2006 at 11:58:35 AM EST
I can imagine a couple of users maxing out a connection like that running p2p software or similar, but online poker? I can't imagine that it would require much bandwidth at all, unless it used a live video feed or something.


--
This is not a psychotic episode. It is a cleansing moment of clarity.
[ Parent ]
What they all said by Rogerborg (2.00 / 0) #4 Tue May 09, 2006 at 02:55:22 AM EST
Is that the poker game with the client that lets you pick some implanted bimbo as an avatar?  What was taking up the bandwidth?  All the "ASL??? ASL??? WANNA SIBER????" chats?

-
Metus amatores matrum compescit, non clementia.
I think I solved it! by DesiredUsername (2.00 / 0) #5 Tue May 09, 2006 at 03:00:48 AM EST
Sure, guys playing poker at work is stupid--but the real issue sounds like the idiots who designed a 2400 baud network for a transnational.

---
Now accepting suggestions for a new sigline
Reply to all by ReallyEvilCanine (2.00 / 0) #6 Tue May 09, 2006 at 03:03:56 AM EST
I wish I knew. I have no idea how the hell it was eating all the resources other than that the servers they were using handled lots of traffic. The company didn't want to talk anymore about the subject. I'm guessing they were central proxies or something.

Likewise I couldn't find anything to show that partypoker was that much of a resource hog. Who knows? Maybe they were also running an eMule server and pulling tons of torrents as well. Your guess is as good as mine.

We forwarded the mail snippet from the customer to our worldwide support centers. We're laughing now...

the internet: amplifier of stupidity -- discordia

Goddamn. by blixco (4.00 / 3) #7 Tue May 09, 2006 at 03:13:32 AM EST
I had a client with a user who took out the network with Half Life.  He was using win98, and had one NIC.  Win98 allowed users to bind more than one IP to their interfaces.  He'd bound an IP for his production net and an IP for the rest of the network, and the resulting ARP storm took out the core switches.

When I disconnected his segment he was all like "dude, you pooched my game, man.  I was totally going to win."

The owner asked me what had happened.  I told him, and he calculated the cost of his network being down + my service call, called the guy into his office and said "You're fired, and your severence pay will be used to pay for this."  From what I understand, lawyers were used and the guy ended up paying out the nose for, like, totally winning the game.
---------------------------------
Taken out of context I must seem so strange - Ani DiFranco

Correction: by BadDoggie (4.00 / 2) #8 Tue May 09, 2006 at 03:19:55 AM EST
Almost totally winning the game. You pooched it, remember?

Didn't NT also allow multiple IP binding?

woof.

OMG WE'RE FUCKED! -- duxup ?

[ Parent ]
I think so. by blixco (4.00 / 1) #10 Tue May 09, 2006 at 03:55:51 AM EST
But win98 wasn't supposed to.  It was a bug.
---------------------------------
Taken out of context I must seem so strange - Ani DiFranco
[ Parent ]
That sure looks like a lot of work by lm (4.00 / 2) #9 Tue May 09, 2006 at 03:22:21 AM EST
Just to find out why HuSi pages are loading slowly.

There is no more degenerate kind of state than that in which the richest are supposed to be the best.
Cicero, The Republic
A Day in the Life | 14 comments (14 topical, 0 hidden) | Trackback