Checking devices for bad sectors


I recently had a friend contact me because he was getting an error similar to the following in his Redhat Linux system log (I didn’t save the error while debugging the problem, so I grabbed this one from the web):

kernel: disk I/O error: dev 08:01, sector 25590410
kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002

At first glance, I thought the disk drive had failed, and told him to back up all of his data to safe media. Once the data was backed up, I decided to run a full SMART self test on the disk drive to check the drives health:

$ smartctl -t long /dev/hda

smartctl version 5.36 [sparc-sun-solaris2.10] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 84 minutes for test to complete.
Test will complete after Sun Aug 27 19:41:01 2006

Use smartctl -X to abort test.

The SMART long test completed successfully, but dd was failing when attempting to read sector 25590410 (we weren’t using the continue on error option). Since all modern disk drive controllers contain logic to remap faulty sectors when they are detected, and the number of reallocated sectors as reported by smartctl was well below the manufacturers failure threshold, I wondered if the sector was “stuck.” To test my theory, I booted from a Linux CD, and ran the Linux badblocks utility on the disk partition (I didn’t save the badblocks output from his drive, so the following is a sample from another machine):

$ badblocks -sv /dev/hda

Checking blocks 0 to 8192016
Checking for bad blocks (read-only test): 222400/ 8192016

Badblocks completed successfully, and an fsck of the file system reported that the file system was clean (We also used the ext3 file system debugger to see if a file was using the block. It wasn’t, so my theory is that the errors occurred when a new file was being created). Next we rebooted the system, and the number of reallocated sectors reported by smartmontools had increased by one. This completely surprised me, and I am still confused why the disk controller didn’t remap the sector when we were booted from the disk drive. I had fun debugging this problem, and learning about how IDE disk drives work.

This article was posted by Matty on 2006-08-27 18:32:00 -0400 -0400