08.03.2005 09:38

Is your hard drive failing? smartmontools for Linux and some BSDs


From some posts to the Charlotte [North Carolina, U.S.A.] Linux User Group 'discuss' mailing list comes a question whether a listmember's hard drive is failing and a helpful suggestion about determing and monitoring the hard drive's health. Briefly, one of the listmember's log files contained
Aug 2 17:01:44 tc kernel: hda: irq timeout: status=0xd0 { Busy }
Aug 2 17:01:44 tc kernel: Aug 2 17:01:44 tc kernel: ide0: reset timed-out, status=0xd0
Aug 2 17:01:44 tc kernel: hda: status timeout: status=0xd0 { Busy }
Aug 2 17:01:44 tc kernel: Aug 2 17:01:44 tc kernel: hda: drive not ready for command
Aug 2 17:01:44 tc kernel: ide0: reset: success
While warning the listmember to copy data from /dev/hda, another listmember suggested installing smartmontools, which in Debian is a simple `apt-get install smartmontools` away. The smartmontools page includes installation instructions for .rpm based Linux distros and other platforms.

I installed smartmontools, and read the manpages for smartd, smartd.conf and smartctl.

I needed to edit /etc/smartd.conf and /etc/default/smartmontools. Specifically, I commented out DEVICESCAN in /etc/smartd.conf and uncomment the /dev/hda line in that file, and I also chose to append '-m [my.email.account@my.domain]' to the /dev/hda line. I also edited /etc/default/smartmontools so that lines now read 'enable_smart="/dev/hda"' and 'start_smartd=yes'.

After installing smartmontools, drives' current status can be reported by doing `smartctl -A /dev/hdx' as root, where 'x' is the letter assigned to the drive. Output includes VALUE, WORST, and THRESH columns.

When I first read the output on my desktop, I was concerned because VALUE and WORST were greater than THRESH, but when I carefully reread the post to discuss@charlug.org, I saw that when VALUE and WORST are less than THRESH, failure is imminent, not when VALUE and WORST are greater than THRESH.