If you have used PCs over the course of your career, I am sure you are well aware of the dreaded “click of death” that occurs when a disk drive fails. These failures can be devastating, and usually cost companies and individuals thousands of dollars, and a good deal of stress. I was curious to see if any solutions were available to address this problem, and came across Self-Monitoring, Analysis, and Reporting Technology (also referred to as SMART or S.M.A.R.T) while researching this problem.
SMART is a method through which devices monitor, store and analyze information on their operational state. This state information is exported through a set of attributes (e.g., temperature, number of reallocated sectors, seek errors), which software solutions can use to measure the health of a device, predict when a device may fail, and provide notifications when attributes are approaching unsafe values.
One software solution that allows you to monitor and manage SMART devices is smartmontools. smartmontools supports a wide variety of hardware and Operating Systems (e.g., FreeBSD, Linux, OpenBSD, OS X, OS/2, Solaris, Windows), contains a complete set of documentation, and includes a command line utility (smartctl(1m)) and UNIX daemon (smartd(1m)) that can be used to view SMART attributes, run device self-tests, and notify support personnel when problems are detected.
To get started with smartmontools, the source code can be downloaded from sourceforge, and the typical “configure,” “make,” and “make install” process can be used to compile and install the software in the default location:
gtar xfvz smartmontools-5.36.tar.gz
sudo make install
If you are using OpenBSD or FreeBSD, you can install smartmontools from the ports collection by executing “make install” in the smartmontools ports directory:
Once the binaries are compiled and installed, the smartctl(1m) utility can be invoked with the “-h” (help) option to print the available options:
smartctl version 5.36 [sparc-sun-solaris2.10] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Usage: smartctl [options] device ============================================ SHOW INFORMATION OPTIONS ===== -h, --help, --usage Display this help and exit -V, --version, --copyright, --license Print license, copyright, and version information and exit -i, --info Show identity information for device -a, --all Show all SMART information for device ================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS ===== -q TYPE, --quietmode=TYPE (ATA) Set smartctl quiet mode to one of: errorsonly, silent -d TYPE, --device=TYPE Specify device type to one of: ata, scsi, marvell, 3ware,N -T TYPE, --tolerance=TYPE (ATA) Tolerance: normal, conservative, permissive, verypermissive -b TYPE, --badsum=TYPE (ATA) Set action on bad checksum to one of: warn, exit, ignore -r TYPE, --report=TYPE Report transactions (see man page) ============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS ===== -s VALUE, --smart=VALUE Enable/disable SMART on device (on/off) -o VALUE, --offlineauto=VALUE (ATA) Enable/disable automatic offline testing on device (on/off) -S VALUE, --saveauto=VALUE (ATA) Enable/disable Attribute autosave on device (on/off) ======================================= READ AND DISPLAY DATA OPTIONS ===== -H, --health Show device SMART health status -c, --capabilities (ATA) Show device SMART capabilities -A, --attributes Show device SMART vendor-specific Attributes and values -l TYPE, --log=TYPE Show device log. TYPE: error, selftest, selective, directory -v N,OPTION , --vendorattribute=N,OPTION (ATA) Set display OPTION for vendor Attribute N (see man page) -F TYPE, --firmwarebug=TYPE (ATA) Use firmware bug workaround: none, samsung, samsung2 -P TYPE, --presets=TYPE (ATA) Drive-specific presets: use, ignore, show, showall ============================================ DEVICE SELF-TEST OPTIONS ===== -t TEST, --test=TEST Run test. TEST is: offline short long conveyance select,M-N pending,N afterselect,on afterselect,off -C, --captive Do test in captive mode (along with -t) -X, --abort Abort any non-captive test on device =================================================== SMARTCTL EXAMPLES ===== smartctl -a /dev/rdsk/c0t0d0s0 (Prints all SMART information) smartctl --smart=on --offlineauto=on --saveauto=on /dev/rdsk/c0t0d0s0 (Enables SMART on first disk) smartctl -t long /dev/rdsk/c0t0d0s0 (Executes extended disk self-test) smartctl --attributes --log=selftest --quietmode=errorsonly /dev/rdsk/c0t0d0s0 (Prints Self-Test & Attribute errors)
Once you review the available options, smartctl(1m) can be invoked with the “-i” (show device identity) option to see if a device supports SMART:
smartctl -i /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.9] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST3120023A Serial Number: 3KA192MF Firmware Version: 3.33 User Capacity: 120,034,123,776 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Fri May 27 10:34:53 2005 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
If SMART is supported by the device, it will be indicated as shown above. If smartctl(1m) indicates that SMART is not enabled, the smartctl(1m) “-s on” ( enable/disable SMART on device ) option can be used to enable SMART on the device passed as an argument:
smartctl -s on /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.9] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled.
SMART compliant devices will support a set of capabilities. These capabilities indicate which SMART features are supported by the device, and includes items such as offline surface scanning support, error logging support, or the ability to perform offline self-tests. To see which capabilities are supported on a device, smartctl(1m) can be executed with the “-c” (show capabilities) option:
smartctl -c /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.10] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 426) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 84) minutes.
Once SMART is enabled and the capabilities have been reviewed, the smartctl(1m) utility can be executed with the “-H” (health status) option to retrieve a devices overall SMART health status:
smartctl -H /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.9] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
In addition to viewing overall drive health, smartctl(1m) allows you to view SMART attributes. These attributes contain information such as operating temperature, reallocated sectors, seek errors, CRC errors, etc. SMART attributes are invaluable for locating environment problems and faulty devices. To view SMART attributes, smartctl(1m) can be invoked with the “-A” (Show device SMART vendor-specific Attributes and values) option, and the device to retrieve the attribute values from:
smartctl -A /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.10] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 056 051 006 Pre-fail Always - 207936224 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 078 062 030 Pre-fail Always - 59492653 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 19215 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 73 194 Temperature_Celsius 0x0022 043 054 000 Old_age Always - 43 195 Hardware_ECC_Recovered 0x001a 056 051 000 Old_age Always - 207936224 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
If an attribute value begin to approach the value defined in the THRESH column, this may be an indication that the device is reaching the end of it’s useful life. If you suspect that a device may be close to failure (e.g., you hear clicking), but the attributes are still above the values defined in the THRESH column, a SMART self test can be performed on the drive. This will cause the device to update the drives SMART attributes, and log any errors it finds to the devices self-test log. To run a self-test, smartctl(1m) can be invoked with the “-t” (test) option, a test to run, and a device to test:
smartctl -t offline /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.9] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART off-line routine immediately in off-line mode". Drive command "Execute SMART off-line routine immediately in off-line mode" successful. Testing has begun. Please wait 426 seconds for test to complete. Test will complete after Wed Sep 28 20:07:26 2005 Use smartctl -X to abort test.
To retrieve the results of the offline self-test, smartctl(1m) can be invoked with the “-l” (Show device log) option, and the log type (e.g., SMART error log, SMART selective self test log, SMART self test log, ot the log directory) to view:
smartctl -l selftest /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.9] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 929 -
This test indicates that the self-test completed without error. To get additional detail on the work that occurs behind the scenes with smartctl(1m), the “-r ioctl” option can be appended to the smartctl(1m) command line:
smartctl -r ioctl -i /dev/rdsk/c0t0d0s0
smartctl version 5.36 [sparc-sun-solaris2.9] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ REPORT-IOCTL: DeviceFD=3 Command=IDENTIFY DEVICE REPORT-IOCTL: DeviceFD=3 Command=IDENTIFY DEVICE returned 0 === START OF INFORMATION SECTION === Device Model: ST320414A Serial Number: 3EC1CNGY Firmware Version: 3.28 User Capacity: 20,404,101,120 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 5 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Mar 3 14:07:58 2005 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE THRESHOLDS REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE THRESHOLDS returned 0
The smartctl(1m) examples up to this point have provided invaluable status information, but the examples required input from the keyboard. To setup automated alerts when devices fail or a SMART attribute changes, a cron job can be developed to check smartctl(1m) self-test logs, or the smartd(1m) daemon can be configured to analyze devices, and report anomalies when they are detected.
This article just began to touch the surface of what smartmontools can do, and I will refer you to the manual pages and documentation for further details. I use smartmontools on my servers and laptop to notify me when disk drives are about to fail. This ensures that I have time to backup my data before a disk drive fails, and saves me from having to purchase large quantities of Advil to deal with unexpected drive failure!!! As with all software, you should read the FAQ and documentation prior to using the software, and perform all testing on a test system. If you have questions or comments on the article, please feel free to E-mail the author.
The following references were used while writing this article:
Ryan would like to thank the smartmontools developers for their awesome work!