Checking devices for bad sectors

I recently had a friend contact me because he was getting an error similar to the following in his Redhat Linux system log (I didn’t save the error while debugging the problem, so I grabbed this one from the web):

kernel: disk I/O error: dev 08:01, sector 25590410
kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002

At first glance, I thought the disk drive had failed, and told him to back up all of his data to safe media. Once the data was backed up, I decided to run a full SMART self test on the disk drive to check the drives health:

$ smartctl -t long /dev/hda

smartctl version 5.36 [sparc-sun-solaris2.10] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 84 minutes for test to complete.
Test will complete after Sun Aug 27 19:41:01 2006

Use smartctl -X to abort test.

The SMART long test completed successfully, but dd was failing when attempting to read sector 25590410 (we weren’t using the continue on error option). Since all modern disk drive controllers contain logic to remap faulty sectors when they are detected, and the number of reallocated sectors as reported by smartctl was well below the manufacturers failure threshold, I wondered if the sector was “stuck.” To test my theory, I booted from a Linux CD, and ran the Linux badblocks utility on the disk partition (I didn’t save the badblocks output from his drive, so the following is a sample from another machine):

$ badblocks -sv /dev/hda

Checking blocks 0 to 8192016
Checking for bad blocks (read-only test):    222400/  8192016

Badblocks completed successfully, and an fsck of the file system reported that the file system was clean (We also used the ext3 file system debugger to see if a file was using the block. It wasn’t, so my theory is that the errors occurred when a new file was being created). Next we rebooted the system, and the number of reallocated sectors reported by smartmontools had increased by one. This completely surprised me, and I am still confused why the disk controller didn’t remap the sector when we were booted from the disk drive. I had fun debugging this problem, and learning about how IDE disk drives work.

1 thought on “Checking devices for bad sectors”

  1. The remapping only happens if the disk can read the sector after all (perhaps after few retries), or when you
    overwrite the sector. This way the disk firmware ensures that no data has been lost (we either get it back or
    replace with new contents, which is of course written to the new sector, while the faulty one is taken out of service).
    This is done on the low level–the firmware doesn’t know whether sector is in use or not—it guards it even though
    to you the contents might be irrelevant.

    Since your bad sector wasn’t in use by the EXT2 filesystem, I am guessing that at some point it was used and
    overwritten, triggering the replacement. Alternatively, maybe the firmware managed to error-correct while reading—after all, you switched the machine on and off in the meantime.

    Note that sector reallocation has been shown to be a predictor of more failures—the Google disk failure paper
    http://labs.google.com/papers/disk_failures.pdf claims 30% chance that the disk with at least one reallocated sector will fail within a year. Have backups ready.

Leave a Reply

Your email address will not be published. Required fields are marked *