A nice graphical interface for smartmontools

In my article out SMART your hard drive, I discussed smartmontools and the smartctl comand line utility in detail. The article shows how to view SMART data on a hard drive, conduct self-tests and shows how to configure smartmontools to generate alerts when a drive is about to or has failed. Recently I learned about GSmartControl, which is a graphical front-end to smartmontools. While I’ve only played with it a bit, it looks like a pretty solid piece of software! The project website has a number of screenshots, and you can download the source from the here. Nice!

SMART utilities for your favorite operating system

While perusing the web a few weeks back, I came across SMARTReporter. SMARTReporter is a wicked cool software package that can be used to monitor hard drive SMART data under OS X, and it is 100% free (you should probably send a small donation to the author if you decide to use it). Now that I have SMARTReporter in my software arsenal, I have a tool to monitor SMART data on each of operating systems I support:

– For Solaris, OpenBSD, FreeBSD and Linux, I use Smartmontools

– For OS X, I use SMARTReporter

– For Windows, I use Active SMART

All three package rock, and they have saved my bacon on more than one occassion!

Solaris needs SMART support! Please help!

While attempting to run the smartctl utility a few weeks back on an x86 Solaris 10 host with IDE disk drives, I received the following error:

$ smartctl -a /dev/dsk/c1d0s0

smartctl version 5.36 [i386-pc-solaris2.10] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

#######################################################################
ATA command routine ata_command_interface() NOT IMPLEMENTED under Solaris.
Please contact smartmontools-support@lists.sourceforge.net if
you want to help in porting smartmontools to Solaris.
#######################################################################

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

It turns out this error is caused by several missing ioctls in the x86 Solaris cmdk IDE device driver. Since I have always wanted to develop (or modify) a kernel device driver, I decided to start reviewing the source code in cmdk.c to see what would be needed to make smartmontools happy. After reading through the cmdkioctl routine in cmdk.c, it dawned on me that the SPARC IDE driver, dad, contained the ioctls that are used by smartmontools. If the dad device driver used the underlying DDI framework, it would be relatively trivial to port the missing pieces from the dad driver (the SPARC IDE driver) to the cmdk driver (the x86 Solaris IDE driver). Well — it turns out the dad driver is closed source (so much for “open” solaris, ey?), so that dashed my hopes of porting code. :( Since there were two open bugs related to SMART support:

Bug ID: 4665068 SMART support in IDE driver
Bug ID: 6280687 Collect SMART data from disks and deliver info to FMA

I decided to call the Sun support center to get the current status of these bugs (I had already pinged my sales team to get the status of these bugs, but my inquiries got silently routed to /dev/null). The support folks weren’t able to tell me if and when these bugs would be fixed, but they did inform me that I could add my employers name to the bugs. They also mentioned that bug fixes are drive by customer demand, so hopefully folks who read my blog can help me out here. Since disk drives are the single component most likely to fail in a system, and SMART can help to proactively detect when a disk drive will fail, it seems only logical that Sun would devote resources to adding SMART support (or at least fixing the existing bugs so people can use third party packages). If you happen to have a support contract with Sun, I would greatly appreciate it if you could open a case (either online or by phone) and request the status of these bugs, and request that your companies name be attached to the bug. Hopefully the customers voice will be heard. :)

Getting around smartmontools linker errors

I posted a blog entry a while back that showed how to setup smartd on a Solaris system. Two individuals left comments indicating that smartmontools wouldn’t build, and one individual (Peter) received the following errors during the build process:

gcc  -g -O2 -Wall -W -c `test -f 'os_solaris_ata.s' || echo './'`os_solaris_ata.s
gcc  -g -O2 -Wall -W   -o smartd  smartd.o atacmdnames.o  atacmds.o ataprint.o knowndrives.o scsicmds.o scsiprint.o utility.o o
s_solaris.o os_solaris_ata.o -lnsl
ld: fatal: relocation error: R_SPARC_32: file os_solaris_ata.o: symbol : offset 0xfea5feef is non-aligned
ld: fatal: relocation error: R_SPARC_32: file os_solaris_ata.o: symbol : offset 0xfea5fef5 is non-aligned
ld: fatal: relocation error: R_SPARC_32: file os_solaris_ata.o: symbol : offset 0xfea5fef9 is non-aligned
ld: fatal: relocation error: R_SPARC_32: file os_solaris_ata.o: symbol : offset 0xfea5fefd is non-aligned
ld: fatal: relocation error: R_SPARC_32: file os_solaris_ata.o: symbol : offset 0xfea72806 is non-aligned
ld: fatal: relocation error: R_SPARC_32: file os_solaris_ata.o: symbol : offset 0xfea77436 is non-aligned
collect2: ld returned 1 exit status

I was able to recreate the problem on my Sun Ultra 10 running Solaris 10. To get smartmontools to compile, I ran “configure,” “make,” and then waited for the build process to fail. Once the build errored out, I removed the “-g” option from the gcc options and executed the following statements by hand:

$ gcc -O2 -Wall -W -c `test -f ‘os_solaris_ata.s’ || echo ‘./’`os_solaris_ata.s

$ make

Since os_solaris_ata.s is a hand coded SPARC assembly file, I am not 100% certain why the smartmontools authors are trying to add debugging symbols to the object file. I will need to do some additional digging to find the answer.

Smartmontools saves the day!

While booting up my x86 laptop this week, I noticed the following errors on the console:

Feb 26 18:16:54 zebox smartd[492]: Device: /dev/ad0, 1 Currently unreadable (pending) sectors
Feb 26 18:16:54 zebox smartd[492]: Device: /dev/ad0, 1 Offline uncorrectable sectors
Feb 26 18:46:55 zebox smartd[492]: Device: /dev/ad0, 1 Currently unreadable (pending) sectors
Feb 26 18:46:55 zebox smartd[492]: Device: /dev/ad0, 1 Offline uncorrectable sectors

Eeeeep — it looks like the disk drive is going bad. To verify this, I decided to run smartctl against the device:

$ smartctl -a /dev/ad0

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       92
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   068   062   030    Pre-fail  Always       -       6508058
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       329
 10 Spin_Retry_Count        0x0013   100   100   034    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       96
187 Unknown_Attribute       0x0032   094   094   000    Old_age   Always       -       6
189 Unknown_Attribute       0x003a   100   100   000    Old_age   Always       -       0
190 Unknown_Attribute       0x0022   065   055   045    Old_age   Always       -       622067747
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       56
193 Load_Cycle_Count        0x0032   090   090   000    Old_age   Always       -       21143
194 Temperature_Celsius     0x0022   035   045   000    Old_age   Always       -       35 (Lifetime Min/Max 0/15)
195 Hardware_ECC_Recovered  0x001a   078   054   000    Old_age   Always       -       177096143
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

Hmmm — the value of Seek_Error_Rate looks extremely high, so I decided to run smartctl a second time to see if the value of Seek_Error_Rate was climbing:

$ smartctl -a /dev/ad0

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       92
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   068   062   030    Pre-fail  Always       -       6508123
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       329
 10 Spin_Retry_Count        0x0013   100   100   034    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       96
187 Unknown_Attribute       0x0032   094   094   000    Old_age   Always       -       6
189 Unknown_Attribute       0x003a   100   100   000    Old_age   Always       -       0
190 Unknown_Attribute       0x0022   065   055   045    Old_age   Always       -       622067747
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       56
193 Load_Cycle_Count        0x0032   090   090   000    Old_age   Always       -       21146
194 Temperature_Celsius     0x0022   035   045   000    Old_age   Always       -       35 (Lifetime Min/Max 0/15)
195 Hardware_ECC_Recovered  0x001a   078   054   000    Old_age   Always       -       177096157
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

Sure enough, the value was increasing at a staggering rate! Since I had just purchased the drive from NewEgg, I gave them a call, and they are going to send me a replacement. Viva la smartmontools!

Adding SMART support to FreeBSD

I first ran across smartmontools about a year ago, and like to install it on all of the systems I manage. This helps me understand the current health of the hard drives in the servers and desktops I manage, and allows me to predict when disk drives are about to fail (this of course assumes that declining attributes are good indicators). Since I am using FreeBSD 6.0 on my personal laptop, I wanted to get smartmontools and smartd working. This was easily accomplished by first installing the smartmontools package with the pkg_add utility:

$ pkg_get -r smartmontools

And secondly by adding a smartd_enable line to /etc/rc.conf to start the SMART daemon (smartd) at system boot time:

$ grep smartd /etc/rc.conf
smartd_enable=”YES”

I really dig the FreeBSD ports tree.