Print Story Troubleshooting
Diary
By jayhawk88 (Wed Mar 02, 2011 at 10:37:56 AM EST) (all tags)
Help me think through this


We have a Electronics Medical Records application. Part of the function of this EMR app is to allow users to send and receive information (largely prescriptions I believe) from the EMR app, via fax.

To accomplish this, a dedicated box we call the FaxPress (basically a custom XP box with a modem bank in it) interfaces with the EMR, or more to the point, the EMR's SQL backend databases (two of them, one for two separate clinics/practices). On the SQL box are four Scheduled Tasks, that run two EXE's: FaxIn and FaxOut, who's purposes should be obvious. There is a task running each EXE for each of the two EMR databases, set to run every 5 minutes.

This setup works great, for about two weeks, give or take a day. Then, at a seemingly random time, the FaxOut Scheduled Tasks will get hung. Normally the task/process will take a max of 10 seconds to run, but for some reason they will suddenly hang, and the task will never complete. The next time the Scheduled Task is set to run, it seems an instance of the task already running, and begins to fail. When this happens, eventually the FaxIn tasks will begin to fail as well, though in a different way, simply returning an error code.

At this point, it's game over. Usually no one notices that faxing is not working until an hour or so after the tasks first hang. You can manually stop the hung tasks (and see them hung in Task Manager as well), end processes, etc, and they will begin to run properly again, but never for more than 3 or 4 times, and eventually they will either hang again, or just not run properly. You can go in and manually run the EXE's (it's just running either faxin.exe or faxout.exe with a simple parameter referencing the database in question), and this seems to work, but does not solve the problem of the Scheduled Tasks failing. The only sure-fire cure is to reboot the server.

We've talked to the vendor about this a couple times, and outside of manually running the EXE's, their suggestion is always to reboot the server. Even better: they suggest that we should setup an automatic reboot of the server every week to avoid this problem. Yeah.

Now, I didn't just fall of the turnip truck here. I know Windows Servers need reboots once in a while, just because, and damned if anyone really knows why. But I'm not going to reboot a Windows Server 2008 R2 server every week because two fucking EXE's tend to hang after 14 days of operation. I'm just not; that's where I draw the fucking line. Maybe that "fix" works for Aunt Bea at the Mayberry Clinic, but not with me, dork. Your shit's retarded.

What am I looking at here? Memory leak? Just this last time is when I noticed the FaxOut.exe's were hanging in Task Manager; after a reboot last night, they are not. I can watch memory usage though and see if there's a slow climb upward (though 2008 uses like 75% of your memory by default, could be tough). It looks like I can't restart the Task Scheduler service without doing some registry hack voodoo. These damn things will literally run like clockwork every 5 minutes, around the clock, and then just bam, toast. No rhyme or reason that I can find. No other mysterious errors in the log around the same time. It's a brand new server (less than a couple months old), I don't think I have bad sectors or faulty memory or anything like that.

Thoughts?

< It's not you Husi, it's me | productivity >
Troubleshooting | 27 comments (27 topical, 0 hidden) | Trackback
Presumably it's out of handles for the faxes, by ambrosen (2.00 / 0) #1 Wed Mar 02, 2011 at 11:15:07 AM EST
I don't know how you'd find out how many TAPI resources are being used, but I suspect that that's what'd give you the answer.

If that were the case though by jayhawk88 (2.00 / 0) #3 Wed Mar 02, 2011 at 11:59:48 AM EST
...running the processes manually wouldn't produce anything meaningful, right?

[ Parent ]
Handles by Herring (2.00 / 0) #26 Fri Mar 04, 2011 at 07:03:42 AM EST
That was my thought - or something being locked and waiting. Anyway, Process Explorer will tell you what handles the locked app is using. Might not help if it's waiting for a DB lock but if it's something else then there could be a clue in there.

christ, we're all old now - StackyMcRacky
[ Parent ]
Jiggle the mouse? by duxup (2.00 / 0) #2 Wed Mar 02, 2011 at 11:25:28 AM EST
n/t

____
It's about that bad by jayhawk88 (4.00 / 1) #4 Wed Mar 02, 2011 at 12:07:24 PM EST
This is the same vendor that is consistently just aghast that we might not want an SQL server full of PHI to have full access to the internet.

[ Parent ]
Yeesh by duxup (2.00 / 0) #5 Wed Mar 02, 2011 at 12:13:30 PM EST
Honestly, if I were in a situation where the vendor is that crappy.... I'd just go with the resets.  It is dumb but fix this issue and considering the vendor I'm guessing you'll hit something else.

____
[ Parent ]
The main problem by jayhawk88 (2.00 / 0) #11 Wed Mar 02, 2011 at 01:10:03 PM EST
...is we have users in this at all hours of the day. Not many off-hours, but there's always one or two crazies in it who just have to check a chart at 3am or whatever.

Which doesn't prevent a weekly reboot if we pick a time and inform everyone, granted, but it just irks me no end that this is their "solution". One of the keyboard monkeys actually told us at one point that rebooting weekly is common for Windows 2008 servers. And again, yes, I know it's fun to joke about Microsoft instability and it's not undeserved. But this is a core server OS we're talking about here; real businesses do real work on Server 2008. It wouldn't be a viable product if this was "common".

Plus, it's a new issue that only popped up when they did a major version upgrade a couple months ago. I know this is just a bug of some kind in their software that they just don't want to cop to or admit to themselves.

Really this is just me venting, obviously. The ticket I submitted with the vendor about this is still just sitting there, no response yet. I'm sure coming up next is the suggestion that the problem is hardware related in some way. Maybe they'll ask me if my server room has adequate ventilation.

[ Parent ]
Version by duxup (2.00 / 0) #12 Wed Mar 02, 2011 at 01:41:41 PM EST
Well there ya go, if you can at least nail it down to a recent change in code you should have a good in with their support folk.

I agree, server reboot should not be necessary, or at least not because of the underlying OS unless you've got a documented bug you can point to with the OS. 

____
[ Parent ]
DAMMIT!!!!! by clock (2.00 / 0) #6 Wed Mar 02, 2011 at 12:32:55 PM EST
OK!  Seriously!  I wrote an app years ago that sent faxes.  After n faxes or y hours (where n was not related to y and where n and y were nearly constants) it would stop sending without dropping anything to the logs or having any legit excuse for quitting.  I never, EVER got a handle on it and since we were a small shop that didn't have time to babysit that shit, I had the service in question bounced every 24 hours and the problem never came back.

This is some wicked stupid bullshit.  Good luck, my son...

Also, CAN'T THE FUCKING FAX JUST LAY DOWN AND DIE LIKE A RESPECTABLE DEAD TECH?!?!?


I agree with clock entirely --Kellnerin

When I was buying the condo by wiredog (2.00 / 0) #7 Wed Mar 02, 2011 at 12:56:03 PM EST
I had to fax all sorts of stuff. Get an e-mail, print it out, sign it, then fax it back. No, you can't scan it and email a pdf, it MUST BE FAXED! VEE HAFF VAYS OF MAKINGK YOU FAX!

There were a few occasions where it would've been easier to hand carry the docs to the office than to fax them...

Earth First!
(We can strip mine the rest later.)

[ Parent ]
s/who's/whose by ammoniacal (2.00 / 0) #8 Wed Mar 02, 2011 at 12:56:36 PM EST
Who's == "Who is," or "belonging to Who."

"To this day that was the most bullshit caesar salad I have every experienced..." - triggerfinger

Can you change the timing? by wiredog (2.00 / 0) #9 Wed Mar 02, 2011 at 12:58:34 PM EST
If it takes 2 weeks to hang @ every five minutes, will it take 4 weeks at 10 minutes? 3 days at every 1 minute?

Earth First!
(We can strip mine the rest later.)

That'd be an interesting test by jayhawk88 (2.00 / 0) #10 Wed Mar 02, 2011 at 01:01:45 PM EST
I could probably get away with bumping it to 10 without anyone noticing.

[ Parent ]
debugging it .. by sasquatchan (2.00 / 0) #13 Wed Mar 02, 2011 at 02:03:26 PM EST
can you remote log into the machine ?

The various sysinternals tools are great at seeing what the resource util of the programs are, as well as what the threads are doing, with in-place call stacks etc.. The author, mark russonvich  has a great blog where he uses solely those tools (and his know-how about how programs, win32 etc work) to debug the problem. Look for his "The case of xxxx" entries.

http://blogs.technet.com/b/markrussinovich/


Debugging vendor products? by dmg (4.00 / 2) #14 Wed Mar 02, 2011 at 02:35:26 PM EST
Just don't. They sold it to you, you are paying for support, make those bastards fix their wack shit. And if the solution is a reboot, so be it. You lose all rights to complain about shoddy engineering the moment you use a Microsoft product to support a mission-critical application. Have we learned NOTHING over all these years? 
--
dmg - HuSi's most dimwitted overprivileged user.
[ Parent ]
it wasn't clear by sasquatchan (2.00 / 0) #21 Thu Mar 03, 2011 at 04:04:52 PM EST
what's the vendors, and what's his code.. Does he just own the DB ?

[ Parent ]
I have no idea, sounds like it's third party. by dmg (4.00 / 1) #27 Fri Mar 04, 2011 at 10:54:29 PM EST
I have a major beef when people complain about apps running on Windows not being stable. You made the choice, you picked the platform, it's not like there's any shortage of publicity on the engineering flaws in Windows. What did you expect? And then when employers expect staff to debug issues with vendor-supplied solutions, I just start to see red. Often times, the engineers supporting crappy windows apps would be more than capable of coding up a reliable alternative, but the corporate policy is 'buy, don't build'.

Anyway nice to get that off my chest.
--
dmg - HuSi's most dimwitted overprivileged user.
[ Parent ]
you're boned by StackyMcRacky (2.00 / 0) #15 Wed Mar 02, 2011 at 02:54:51 PM EST
back at my former employer, I can't tell you the number of Windows servers that had weekly reboots, just to keep the services from crashing during the week.  A certain healthcare company that starts with Mc and ends with Kesson was the absolute worst with this problem.  That was their fix for EVERTHING.  Complete shit software.


Can you edit the scheduled task for FaxOut by chuckles (4.00 / 1) #16 Wed Mar 02, 2011 at 03:08:52 PM EST
so the first thing it does is kill any FaxOut.exe processes?

"The one absolutely certain way of bringing this nation to ruin [...] would be to permit it to become a tangle of squabbling nationalities"
Perhaps by jayhawk88 (2.00 / 0) #17 Wed Mar 02, 2011 at 03:22:39 PM EST
But I'm not sure if that would fix the problem. Even if the hung processes are killed manually, eventually the problem creeps up again. Maybe worth a look though.

[ Parent ]
locking in the database? by lm (2.00 / 0) #18 Thu Mar 03, 2011 at 06:07:49 AM EST
Have you checked for deadlocked, blocked or suspended processes in the database when the binaries start failing?

Also, it may or may not be informative to set up a trace in the database to see what the binaries to the data are doing when they hang.


There is no more degenerate kind of state than that in which the richest are supposed to be the best.
Cicero, The Republic
and along the same lines by lm (2.00 / 0) #19 Thu Mar 03, 2011 at 06:13:26 AM EST
have you tried restarting the database after one of these incidents?

There is no more degenerate kind of state than that in which the richest are supposed to be the best.
Cicero, The Republic
[ Parent ]
No by jayhawk88 (2.00 / 0) #20 Thu Mar 03, 2011 at 09:30:30 AM EST
Mostly because, from the users perspective, that's as bad as restarting the server. But it would at least tell me if it's a database problem or a problem with the OS. Might have to try that next time.

[ Parent ]
Look for blocked processes before you kill it by lm (2.00 / 0) #22 Thu Mar 03, 2011 at 06:44:34 PM EST
I didn't notice whether you mentioned which database you're using or not, but most modern databases have pretty good tools that let you see which processes are blocked by which other processes.

There is no more degenerate kind of state than that in which the richest are supposed to be the best.
Cicero, The Republic
[ Parent ]
Clock tick count by lb008d (2.00 / 0) #23 Fri Mar 04, 2011 at 12:32:41 AM EST
A shot in the dark:

http://ayende.com/Blog/archive/2009/10/28/an-epic-bug-story.aspx

Now that I've read more ... by lb008d (2.00 / 0) #24 Fri Mar 04, 2011 at 12:38:25 AM EST
I'd look into locks on the DB first.

And no, Windows servers shouldn't need reboots, ever - that is a hack fix like you said.

[ Parent ]
That link brought up another idea by lm (2.00 / 0) #25 Fri Mar 04, 2011 at 06:57:51 AM EST
Is it possible that the executable creates temporary files that get blown away after a reboot?

There is no more degenerate kind of state than that in which the richest are supposed to be the best.
Cicero, The Republic
[ Parent ]
Troubleshooting | 27 comments (27 topical, 0 hidden) | Trackback