Sinking feeling
By ucblockhead (Sun May 21, 2006 at 08:58:22 AM EST)

Over the course of the last few months, my Linux always-on server has turned off from time to time. Never when I'm there, so I've always blamed it on contractors turning the power off. (Even though I have a UPS.)

Yesterday, I tried to use the machine only to have it report errors on all disk read attempts. I rebooted and the BIOS didn't recognize any of my tree SATA drives. I left it off for a few hours as we had a party to return to. I powered on afterwords and it worked fine. I blamed it on heat.

The next morning, it was still on. I tried to use it and immediately got disk read fails. I again rebooted and again none of the drives were recognized. This is a bit odd as two drives are attached to the motherboard and another to a separate card. I fiddled with cables and got it to boot long enough for it to try an automatic fsck but not long enough to finish. Then I realized that all three drives were connected on the same power line. (I have a newer power supply with power cables that plug in at both ends.) Fiddling with cables seemed to get it to stop failing to recognize the drives. I booted on my Knoppix disk. Three partitions seemed fine. One, it couldn't recognize. My home directory.

This caused some consternation as just yesterday I'd noticed that one important directory wasn't being backed up: the directory that contained about 500 scanned family photos, including all the wedding pictures.

Booting without Knoppix eventually got me a single-user root prompt. Running fsck reported a couple nasty errors, but now the machine seems to boot correctly. I wonder what I should do...

(Besides making backups, that is.)

hmm by theantix (2.00 / 0) #1 Sun May 21, 2006 at 09:09:17 AM EST
fsck giving some nasty errors is pretty much normal if you've abused the disks a little bit, but probably isn't a big concern unless it keeps happening.

If moving the cables around seems to have helped, why mess with it?  But if it keeps on dying from power-related problems, getting a larger power supply is of course your best bet.

You sir, are worse than Hitler.

It's plenty big by ucblockhead (2.00 / 0) #2 Sun May 21, 2006 at 09:15:36 AM EST
The power supply is plenty big enough...I just wonder if it (or the cables) are being flaky.

I'm 95% sure the fsck errors were caused by random power loss not disk problems. I wonder if it could be some motherboard issue, though one drive is on a separate SATA card. Basically, the drives all seem to power down on their own together.
Probably motherboard / PSU, but by ni (4.00 / 1) #3 Sun May 21, 2006 at 09:47:45 AM EST
don't forget the obvious stuff: The power cable and socket. Normally I wouldn't bother mentioning this, but I woke up 20 minutes ago to discover that the socket had spontaneously ejected the plug of the power cable for my server.

Stranger still, looking at the logs indicates that this happened at the exact moment I woke up. It was down for maybe 4 minutes. I suppose I must have sensed a disturbance in the force.

Yeah by ucblockhead (2.00 / 0) #4 Sun May 21, 2006 at 10:11:56 AM EST
I changed a couple connectors and jiggled everything just in case.
You should look at the drive status with smartctl by crux (2.00 / 0) #5 Sun May 21, 2006 at 10:56:01 AM EST
Also, make sure your cooling fans aren't dead -- especially the power supply fan, perhaps.

I've had somewhat cronic problems with the only cooling fan in my always-on server crapping out; the blessed fail-safe auto-shut-off-on-overheat has saved me from a meltdown two or three times now.(Not counting the times I hadn't figured out the problem!)

I don't think it's a fan by ucblockhead (2.00 / 0) #6 Sun May 21, 2006 at 11:58:14 AM EST
I'm not hearing any bad bearing noise and they seem to be running. I'm also not feeling particular heat from the box. We'll see what happens when it's on for a day.
Last time this happened to me by localroger pod person (2.00 / 0) #7 Sun May 21, 2006 at 04:18:23 PM EST
turned out the CPU heat sink was thoroughly clogged with dust.  Wasn't apparent until I removed the heatsink and fan to get the expansion cards out so I could put them in another machine.  The CPU fan still ran, but no air could get through because of the fin cloggage.

hmmmm by ucblockhead (2.00 / 0) #8 Sun May 21, 2006 at 04:32:34 PM EST
I'll give it a look if it dies overnight.
it's the hardware silly! by martingale (4.00 / 2) #9 Sun May 21, 2006 at 09:09:28 PM EST
Use the right kind of hardware for Linux and you'll never have these kinds of problems. I suggest a quartz wristwatch. When have you ever heard of a wristwatch failure? They last for years! Much longer than that flaky hardware you're talking about.

P.S. Wristwatches rule! I know where I can get you a Rolex for a great price, just say the word.
$E(X_t|F_s) = X_s,\quad t > s$

Thanks, but by ucblockhead (4.00 / 2) #10 Mon May 22, 2006 at 06:50:26 AM EST
A guy just emailed me about a great deal on a Relox. It comes with a free bottle of Veagra!
$E(X_t|F_s) = X_s,\quad t > s$