There have been a number of articles written over the past few years that talk about how silent data corruption can occur due to faulty hardware, solar flares as well as software defects. I’ve seen some oddities in the past that would probably fall into these categories, but without sufficient time to dig deep it’s impossible to know for sure.
With ZFS this is no longer the case. ZFS checksums every block of data that is written to disk, and compares this checksum when the data is read back into memory. If the checksums don’t match we know the data was changed by something other than ZFS (assuming a ZFS bug isn’t the culprit), and assuming we are using ZFS to RAID protect the storage the issue will be autoamtically fixed for us.
But what if you have a lot of data on disk that isn’t read often? Well, there is a solution. ZFS provides a scrub option to read back all of the data in the file system and validate that the data still matches the computed checksum. This feature can be access by running the zpool utility with the “scrub” option and the name of the pool to scrub:
$ zpool scrub rpool
To view the status of the scrub you can run the zpool utility with the “status” option:
$ zpool status
pool: rpool
state: ONLINE
scrub: scrub in progress for 0h0m, 3.81% done, 0h18m to go
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
errors: No known data errors
The scrub operation will consume any and all I/O resources on the system (there are supposed to be throttles in place, but I’ve yet to see them work effectively), so you definitely want to run it when you’re system isn’t busy servicing your customers. If you kick off a scrub and determine that it needs to be haulted, you can add a “-s” option (stop scrubbing) to the zpool scrub command line:
$ zpool scrub -s rpool
You can confirm the scrub stopped by running zpool again:
$ zpool status
pool: rpool
state: ONLINE
scrub: scrub stopped after 0h0m with 0 errors on Sat Oct 15 08:28:36 2011
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
errors: No known data errors
This is pretty darn useful, and something I wish every file system had. fsck sucks, and being able to periodically check the consistency of your file system while it’s online is rad (for some reason I always want to watch Point Break after saying rad).