Systems administrators managing a data center face numerous challenges to achieve required availability and uptime. Two of the main challenges are shrinking budgets (for hardware, software, and staffing) and short deadlines in which to deliver solutions. The Linux community has developed kernel support for software RAID (Redundant Array of Inexpensive Disks) to help meet those challenges. Software RAID, properly implemented, can eliminate system downtime caused by disk drive errors. The source code to the Linux kernel, the RAID modules, and the raidtools package are available at minimal cost under the GNU Public License. The interface is well documented and comprehensible to a moderately experienced Linux systems administrator.
In this article, I’ll provide an overview of the software RAID implementation in the Linux 2.4.X kernel. I will describe the creation and activation of software RAID devices as well as the management of active RAID devices. Finally, I will discuss some procedures for recovering from a failed disk unit.
RAID is a set of algorithms for writing data blocks to disk devices. Each RAID mode, or level, specifies the layout of data blocks on multiple disks. Each RAID mode provides an enhancement in one aspect of data management: redundancy or reliability, read or write performance, or logical unit capacity. Simple RAID modes are named with an integer number: RAID 0, RAID 1, or RAID 5. Complex RAID modes that combine multiple simple modes are named with a combined name: RAID 0+1, RAID 1+0.
RAID 0 is used to enhance the read/write performance of large data sets, and to increase logical unit capacity beyond the limits of a single disk device. RAID 0 can also be used to write consecutive data blocks onto multiple disk devices to improve read/write performance of large data sets. RAID 0 provides no data recovery capability.
RAID 1, or mirroring, is used to provide high reliability and fault tolerance. In RAID 1, each data block is written to multiple disk devices simultaneous ly. If one disk device were to fail, all data could be recovered from one of the mirror disks. The cost of RAID 1 data sets increases with the number of mirror sets used.
RAID 5, also referred to as striping with parity, distributes data blocks and parity blocks across all devices in a RAID device. Parity calculations are performed for each write operation, and used to regenerate data if a disk failure is detected. With the ability to stripe data across RAID 5 devices, read performance can be optimized. RAID 5 also maximizes available disk storage, allowing RAID devices to gain additional capacity without losing redundancy.
Complex RAID modes involve the combination of the benefits of single RAID levels into the same logical unit. RAID 0+1 is the striping or concatenation of multiple disk devices into a larger logical unit with additional disk devices allocated to mirror the striped devices. A 45-Gb RAID 0+1 logical unit would require ten 9-Gb disk devices. Data blocks would be striped or concatenated on five of the disk devices. Simultaneously, each data block would be written to one of the other five disk devices to provide a mirror for the entire logical unit. RAID 1+0 is the use of mirrored disk devices to form larger striped or concatenated logical units. There is no difference between the logical unit presented to the operating system from RAID 0+1 and RAID 1+0 – it is 45 Gb either way. The same number of disk devices are required to implement either.
Current Linux kernels support RAID – both hardware and software – for many disk devices and controllers. The distinction between hardware RAID and software RAID is in location of the RAID mode implementation. Hardware RAID solutions require specialized hardware (disk controllers, disk enclosures, and/or drives). Software RAID is implemented by the operating system of the server to which the devices are attached. The trade-off is between price and performance. Most hardware RAID controllers have dedicated processing units and non-volatile cache memory. The controller can acknowledge the write as completed to the operating system immediately and perform the physical writes to disk later, increasing performance and removing processing load from the server. Some RAID hardware devices support replacing physical disk units without taking the server offline.
Software RAID is an emulation of what hardware RAID devices do. There are some disadvantages of software RAID compared to hardware RAID: some disk device write performance is lost; there is additional processing burden on the server; and the hot-swappability of disk units is not available. However, the cost of standard disk controllers and devices is much less than those that support RAID modes in hardware. Often a combination of hardware RAID devices and software RAID will provide a flexible and maintainable solution that fits within the availability and budget constraints of the application.
Hardware RAID controllers allocate storage from a pool of available disks into a logical unit of disk storage, which is presented to the operating system. Most RAID controllers support RAID 0, RAID 1, RAID 5, RAID 0+1, and RAID 1+0. When RAID layouts are implemented in software, the kernel is responsible for managing individual disk units. The RAID drivers keep track of which disk units are assigned to each logical unit and where to read or write the raw data. The RAID logical unit is presented to the operating system as an abstract disk, upon which any type of Linux disk filesystem (e.g., ext2, ext3, reiserfs) can be installed. The filesystem interface is unaware both that the “disk unit” is actually an array of multiple disks and of how the data is laid out among them.
The remainder of this article will deal specifically with the Linux RAID implementation in software. In Linux documentation, the software RAID implementat ion is also referred to as MD (multiple disk). Many of the commands demonstrated are from the raidtools package that must be installed to manage RAID devices . The mdadm package is also available to create, manage, and monitor MD devices. This tool provides a variety of advanced features, but will not be covered in this article. For additional information on mdadm, please see the references.
A word of caution – the examples in this article are from my test RAID systems. Please study all relevant documentation and plan carefully before adding or changing system parameters. I highly recommend that everything be implemented and tested on a non-production system before making changes to any live systems.
The Linux kernel supports RAID 0, RAID 1, RAID 4, or RAID 5. RAID devices can also be combined to implement RAID1+0 or RAID 0+1 layouts for additional availability or performance. The kernel also supports the allocation of one or more hot spare disk units per RAID device. A hot spare disk is one that is not used to store data or parity blocks – it is available to the RAID device for recovery if one of the other disks comprising the device fails. When a disk fault is detected, the operating system begins to resynchronize data from the failed disk onto the hot spare disk. The faulty disk drive can be replaced later.
A recent version of the Linux kernel should be used to implement software RAID. All examples in this article are from kernel version 2.4.20. Some Linux distributions include kernels with RAID support precompiled into the kernel, and contain the raidtools package which is required to manage software RAID devices. Recent versions of Redhat and SuSE distributions include kernels with built-in RAID support, which can be configured at installation time, or once the server is operational.
If you compile a new kernel, you must enable RAID support in your kernel configuration. Include support for all RAID layouts (modes) that will be used:
Multiple devices driver support (RAID and LVM) (CONFIG_MD) [Y/n/?] y
RAID support (CONFIG_BLK_DEV_MD) [M/n/y/?] y
Linear (append) mode (CONFIG_MD_LINEAR) [M/n/y/?] y
RAID-0 (striping) mode (CONFIG_MD_RAID0) [M/n/y/?] y
RAID-1 (mirroring) mode (CONFIG_MD_RAID1) [M/n/y/?] y
RAID-4/RAID-5 mode (CONFIG_MD_RAID5) [M/n/y/?] y
Once the kernel has been configured with support for software RAID, it must be compiled and installed, and the system should be booted with the new kernel . The dmesg command can verify that software RAID is enabled in the running kernel:
$ dmesg | grep ^md | head -3
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun .
The presence of md messages indicates that the kernel has loaded the software RAID drivers and is attempting to detect RAID devices attached to the server . During the autodetect process, the kernel reads the partition identification tags of each disk partition available to the system. Any partition that has a partition identification tag of 0xFD is a RAID partition. The operating system will attempt to start each RAID partition at autodetect time. The fdisk utility can be used to view the partition tags of any disk device:
$ fdisk -l /dev/hda
Disk /dev/hda: 100.0 GB, 100030242816 bytes255 heads, \
63 sectors/track, 12161 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 12095 97153056 fd Linux raid autodetect
/dev/hda2 12096 12160 522112+ fd Linux raid autodetect
A software RAID device consists of one or more disk partitions (or other disk devices) that are combined to form a RAID device. The devices are combined in one of the supported basic RAID modes: RAID 0, RAID 1, RAID 4, or RAID 5. The configuration file that assigns partitions and records other details of each RAID device is /etc/raidtab. Each section of /etc/raidtab that defines a software RAID device begins with the keyword raiddev:
raiddev /dev/md0
raid-level 1
nr-raid-disks 2
persistent-superblock 1
nr-spare-disks 0
device /dev/hda1
raid-disk 0
device /dev/hdc1
raid-disk 1
The device name is identified in the raiddev directive. The RAID mode is defined in the raid-level directive. The nr-raid-disks directive specifies how many disks (partitions or other disk devices) will be used to comprise the RAID device. The persistent-superblock instructs the kernel to write the RAID configuration data at the end of each partition, which helps the kernel to identify the RAID configuration in the system. The nr-spare-disks directive is used to define hot-spare disks, described later. Following the general device defininitions, each disk device that will comprise the RAID device is specified with a device directive. The device specified can be a disk partition, a disk device, or another RAID device. Each device directive can be further specified by other directives. The raid-disk directive indicates the relative position within the RAID layout of each disk device (from 0 to (nr-raid-disks - 1). For striping layouts, such as RAID 5, this indicates the column where that disk device will lie. For RAID 1 and RAID 5, it is important to allocate disk units to each raid-disk that have the same physical disk geometry (cylinders, sectors, tracks) and are the same size.
RAID 5 devices are defined with the raid-level directive, and a number 5 to specify the RAID mode. An example three-column RAID device would be:
raiddev /dev/md1
raid-level 5
nr-raid-disks 3
nr-spare-disks 0
persistent-superblock 1
parity-algorithm left-symmetric
chunk-size 32
device /dev/hda1
raid-disk 0
device /dev/hdc1
raid-disk 1
device /dev/hde1
raid-disk 2
The chunk-size directive specifies how many kilobytes will be written as a data block to each column of the RAID device. This is also the size of the parity block that will be calculated for each data block. The optimal value for chunk-size depends on the application and the disk devices. The parity-algorithm directive specifies how parity blocks will be calculated based on the data blocks, and how to organize the location of the parity blocks across columns.
Each software RAID device that is defined in /etc/raidtab must be initialized. The mkraid utility creates the device node, initializes all the devices that comprise the RAID device, and starts the RAID device. Be aware that initialization with mkraid will destroy data that currently exists on any of the devices that are specified in /etc/raidtab for that RAID device:
$ mkraid /dev/md0
The status of RAID device initialization can be monitored via /proc/mdstat. When the devices have been initialized and started, filesystems can be created on the RAID device:
$ mke2fs -j /dev/md0
$ mkdir /data/snowball
$ mount /dev/md0 /data/snowball
After a filesystem has been created on the RAID device, the RAID device can be managed like any other Linux filesystem. To have a filesystem mount upon system boot, add an entry in /etc/fstab:
/dev/md0 /data/snowball ext3 defaults 1 2
Once mounted as a filesystem, the RAID device is used like any other filesystem on the server.
Under normal circumstances, software RAID devices are started during the autodetect process. Some events may require that the RAID devices be stopped while the server is still running. Before stopping a software RAID device, the kernel forces any buffered data to be written to the RAID device. After all buffered data has been written, the RAID device is stopped. The data on a stopped RAID device is not accessible to the operating system. The underlying disk devices that comprise the RAID device remain accessible, although they should not be modified or the RAID device will be corrupted. The raidstop utility is used to stop a RAID device. Mounted filesystems on the RAID device must be unmounted before stopping the underlying device:
$ umount /data/snowball
$ raidstop /dev/md0
A stopped software RAID device must be explicitly started before it can be used by the server. The raidstart utility is used to start a RAID device. Once started, the filesystem can be mounted:
$ raidstart /dev/md0
$ mount /dev/md0 /data/snowball
The Linux software RAID implementation supports one or more hot spare devices to be assigned to a RAID device. A hot spare device is a disk device that is available to a RAID device to replace one of the component disk devices in case of a disk fault or failure. Spare disks enable the RAID device to continue to operate, and begin recovery procedures, in real-time. When the kernel receives notice of a failed disk device that is part of a RAID device, the RAID device is checked for an available hot spare device. If a spare disk is available, the kernel will logically replace the faulted disk with the spare in the RAID device.
RAID modes support continued operation with the loss of one disk unit in the device because each data block can be resynchronized from the other disks, either as a mirror block, or as a parity block. The nr-spare-disks directive in /etc/raidtab indicates how many spare disks are available to the RAID device. A device directive is specified for each spare disk. A spare-disk directive follows the device directive instead of the raid-disk directive. The following excerpt from /etc/raidtab defines the previous example RAID device with one hot spare disk available:
raiddev /dev/md0
raid-level 1
nr-raid-disks 2
persistent-superblock 1
nr-spare-disks 1
device /dev/hda1
raid-disk 0
device /dev/hdc1
raid-disk 1
device /dev/hde1
spare-disk 0
The raidhotadd utility will add a hot spare disk to a RAID device that is running. raidhotadd will not modify /etc/raidtab. If a spare disk is added using raidhotadd, the systems administrator must add it to the appropriate RAID device specification in /etc/raidtab before the system reboots:
$ raidhotadd /dev/md0 /dev/hde1
The resynchronization of a hot spare device can be monitored via /proc/mdstat.
The Linux software RAID implementation reports the status of all RAID devices in /proc/mdstat. /proc/mdstat is a pseudo-file that can be read by any Linux utility that manipulate text files (e.g., more, grep). When all RAID devices are started and operating correctly, the following output would be seen:
$ cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hda1[0] hdc1[1] hde1[2]
97152960 blocks [2/2] [UU]
For each RAID device, the output includes the status (active), the RAID mode (raid1), a list of partitions that comprise the RAID device and their order, total device size, partition, and a status code letter for each partition. The output contains the software RAID device, a list of active partitions comprising that device, size information, and a status code letter for each partition. A status code of U indicates that the partition is operational. A status code of _ (underscore) indicates that a partition had a disk fault. When a partition has faulted, it is removed from the list of active partitions.
The following output would be seen for the RAID device when /dev/hdc1 is operational:
$ cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hda1[0] hdc1[1] hde1[2]
97152960 blocks [2/2] [UU]
The following output would be seen for the same RAID device when /dev/hdc1 faulted:
$ cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hda1[0] hde1[2]
97152960 blocks [2/1] [U_]
The following output would be seen for the same RAID device when /dev/hde is being used to resynchronize the data from hda:
$ cat /proc/mdstat
md0 : active raid1 hda1[0] hde1[2]
97152960 blocks [2/1] [U_]
[========>............] recovery = 43.9% (42713908/97152960) \
finish=82.4min speed=11002K/sec
The following example script monitors /proc/mdstat for indication of a RAID device failure. If a RAID device fails, a message is logged via syslog and an email message is sent to the on-call pager. The script can be run periodically via cron or modified to run continuously as a daemon on system startup:
#!/bin/bash
ADMIN="page-oncall@mydomain.com"
HOSTNAME='/bin/hostname'
if egrep "\[.*_.*\]" /proc/mdstat > /dev/null
then
logger -p daemon.error "mdcheck: Failure of \
one or more software RAID devices"
echo "Failure of one or more software RAID \
devices on ${HOSTNAME}" | /bin/mail -s \
"$0: Software RAID device failure on \
${HOSTNAME}" ${ADMIN}
fi
Eventually, a disk drive comprising a software RAID device will fail and require replacement. The failed disk can be determined from the output of /proc/mdstat, and the failed unit must be removed and replaced with a working disk unit. It is important to replace the disk with one that is physically identical to the failed disk. The replacement disk also must be partitioned identically to the failed disk. The raidhotadd command can then be issued to tell the RAID device drivers to activate the replaced disk and begin the resynchronization of data. The progress of resynchronization can be monitored via /proc/mdstat:
$ raidhotadd -a /dev/md0 /dev/hdc1
Linux software RAID provides systems administrators with the means to implement the reliability and performance of RAID without the cost of hardware RAID devices. The kernel supports all basic RAID modes and complex RAID devices can be created by using RAID devices as logical partitions. The examples presented were simple RAID 1 and RAID 5 configurations. The judicious allocation of hot spare devices to RAID units will reduce the risk of performance degradation when a disk device fails. The best results will be achieved by planning the desired configuration in a diagram before modifying system configurations. Any changes should be thoroughly tested before implementing them on a production system.
The following references were used while writing this article:
Ryan would like to thank Rob Jenson from Spotch Consulting for editing the article. He would also like to thank the kernel developers, and the architects and programmers responsible for the MD device driver code.
Originally published in the April ‘04 issue of SysAdmin Magazine