Diary of a geek

November 2008
Mon Tue Wed Thu Fri Sat Sun
         
28

Andrew Pollock

Categories

Other people's blogs

Subscribe

RSS feed

Contact me

JavaScript required


Friday, 28 November 2008

Wising up about SMART

I've always installed the smartmontools package on my servers because I figured it'd maybe give me some early warning about an impending disk failure.

Getting proper SMART access was one of the motivating factors in moving my home-brew SAN from IDE-to-USB disks to native SATA.

Since then, I've had two drives fail on me. One of the original four that I bought, and the replacement one that Seagate shipped to me. (Incidentally, I love how your only recourse to a replacement drive failing with Seagate is they'll ship you another (reconditioned) replacement drive. I can see the shipping expenses racking up very quickly).

I decided to do the expedited return when the second drive failed, just because it meant I didn't have to be without a disk in my RAID-10 for a week while I waited for them to receive my drive and turn around and ship me a replacement drive, and it also meant I didn't have to mess around with packaging it myself.

(Incidentally, I wonder what happens in Seagate's reconditioning process? Will someone else get my failed drive (i.e. the same serial number) after they replace the platters or whatever it was that failed? Would it be interesting to set up a website where people could register the serial numbers of drives they returned to Seagate, and how they died, and at how many power-on hours? How many times does a drive get "reborn"?)

Whilst I was waiting for the replacement drive to arrive, I messed around with the failed one (I removed it from the RAID array), fiddling with the SMART self-tests that smartctl can initiate for you. The faulty drive failed the long self-test at the same LBA every time I ran it.

It was around this time that I realised that the default configuration that comes with the Debian smartmontools package isn't all that useful with SATA disks.

It ships with

DEVICESCAN -m root -M exec /usr/share/smartmontools/smartd-runner

which, as it turns out, doesn't detect SATA drives properly. I got

Nov 24 22:00:13 minotaur smartd[2990]: Device /dev/sda: ATA disk detected behind SAT layer 
Nov 24 22:00:13 minotaur smartd[2990]:   Try adding '-d sat' to the device line in the smartd.conf file. 
Nov 24 22:00:13 minotaur smartd[2990]:   For example: '/dev/sda -a -d sat' 

logged. It seems to me that if the software can detect this situation, it should be able to just try the "-d sat" behaviour automatically.

So it seems that this whole time, I haven't been getting any SMART monitoring of my SATA drives. hddtemp has been happily logging the temperatures for me though.

So I took the advice in the comments of /etc/smartd.conf and removed the DEVICESCAN entry in favour of explicitly naming the disks I wanted to monitor.

I also discovered that I can have smartd kick off scheduled short and long self-tests on whatever schedule floats my boat.

So now I've got something like this:

/dev/sda -d sat \
         -M exec /usr/share/smartmontools/smartd-runner \
         -s (L/../.././01|S/../.././(00|12))

This runs a long self-test at 1am every day, and a short one at midnight and midday. Hopefully this will help predict the next drive failure a bit earlier.

I'm now debating whether to set up an active Nagios monitor for SMART, or to see if I can write a passive one that runs as part of the smartd-runner infrastructure that the smartmontools package provides (and is pretty neat). I guess it's already setup to email me, so it doesn't really matter whether it's Nagios emailing me, or smartd itself.

[23:05] [tech] [permalink]