Using the ZFS scrub feature to verify the integrity of your storage

There have been a number of articles written over the past few years that talk about how silent data corruption can occur due to faulty hardware, solar flares as well as software defects. I’ve seen some oddities in the past that would probably fall into these categories, but without sufficient time to dig deep it’s impossible to know for sure.

With ZFS this is no longer the case. ZFS checksums every block of data that is written to disk, and compares this checksum when the data is read back into memory. If the checksums don’t match we know the data was changed by something other than ZFS (assuming a ZFS bug isn’t the culprit), and assuming we are using ZFS to RAID protect the storage the issue will be autoamtically fixed for us.

But what if you have a lot of data on disk that isn’t read often? Well, there is a solution. ZFS provides a scrub option to read back all of the data in the file system and validate that the data still matches the computed checksum. This feature can be access by running the zpool utility with the “scrub” option and the name of the pool to scrub:

$ zpool scrub rpool

To view the status of the scrub you can run the zpool utility with the “status” option:

$ zpool status

  pool: rpool
 state: ONLINE
 scrub: scrub in progress for 0h0m, 3.81% done, 0h18m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

The scrub operation will consume any and all I/O resources on the system (there are supposed to be throttles in place, but I’ve yet to see them work effectively), so you definitely want to run it when you’re system isn’t busy servicing your customers. If you kick off a scrub and determine that it needs to be haulted, you can add a “-s” option (stop scrubbing) to the zpool scrub command line:

$ zpool scrub -s rpool

You can confirm the scrub stopped by running zpool again:

$ zpool status

  pool: rpool
 state: ONLINE
 scrub: scrub stopped after 0h0m with 0 errors on Sat Oct 15 08:28:36 2011
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

This is pretty darn useful, and something I wish every file system had. fsck sucks, and being able to periodically check the consistency of your file system while it’s online is rad (for some reason I always want to watch Point Break after saying rad).

4 thoughts on “Using the ZFS scrub feature to verify the integrity of your storage”

  1. With regards to a scrub consuming all available I/O, there was a thread on the zfs-discuss mailing list earlier this year (http://mail.opensolaris.org/pipermail/zfs-discuss/2011-May/thread.html#48332) which talked about some of the Solaris ZFS tuneables you can fiddle with to tweak the zpool scrub performance.

    In particular, the “zfs:zfs_scrub_delay” (I believe) defines the # of seconds to pause a scrub if there’s I/O in the pool. If you set this to 0, then the scrub will go to town regards of activity on the pool. For example, I have the following line in my /etc/system file to set this tuneable:
    set zfs:zfs_scrub_delay = 0x1

    There’s a bunch of great ZFS tuning tips in that thread (linked to above)…if you’re running ZFS on some Solaris-variant.

Leave a Reply

Your email address will not be published. Required fields are marked *