Speeding up md device rebuilds


I manage a large RAID5 array at home, and had one of my disks crap out over the weekend. Once I physically replaced the drive and told mdadm to reconstruct the array, I noticed that the rebuild was going to take days to complete. After a bit of digging, it appears that the mdrecovery process throttles itself to prevent the recovery process from consuming all I/O. I was most concerned about getting the RAID array back into a consistent state, so I decided to play around with the speed_limit_max setting to speed up the recovery. The speed_limit_max setting controls the maximum amount of data that can be written to each device in the RAID array, and bumping it up to a large value (the value appears to be the number of bytes written per second) definitely lowered the reconstruction time:

$ echo 400000 > /proc/sys/dev/raid/speed_limit_max

The rebuild went from days down to hours:

$ watch cat /proc/mdstat

Every 2.0s: cat mdstat Fri Mar 20 21:14:45 2009

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdf1[5] sde1[3] sdd1[2] sdc1[1] sdb1[0]
976751616 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
[=========>...........] recovery = 49.6% (121176712/244187904) finish=68.6min speed=29873K/sec

unused devices:

I really dig MD, and the fact that you can now expand RAID devices (previously you had to layer LVM on top of MD to expand RAID devices on the fly) is extremely cool! To ensure that my host automatically recovers in the future, I am in the process of adding a hot spare. This combined with a cron job to increase speed_limit_max during off hours seems like a great fit!

This article was posted by Matty on 2009-03-22 13:17:00 -0400 -0400