Another interesting finding about gluster replicas

In a previous post I talked about my problems getting gluster to expand the number of replicas in a volume. While experimenting with the gluster utilities “add-brick” option I wanted to see if adding two more bricks would replicate the existing data across four bricks (two old, two new), or if the two new bricks would be a replica pair and the two previous bricks would be a replica pair. To see what would happen I added two more bricks:

$ gluster volume add-brick glustervol01
centos-cluster01.homefetch.net:/gluster/vol01
centos-cluster02.homefetch.net:/gluster/vol01

Add Brick successful

And then checked out the status of the volume:

$ gluster volume info glustervol01

Volume Name: glustervol01
Type: Distributed-Replicate
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: fedora-cluster01.homefetch.net:/gluster/vol01
Brick2: fedora-cluster02.homefetch.net:/gluster/vol01
Brick3: centos-cluster01.homefetch.net:/gluster/vol01
Brick4: centos-cluster02.homefetch.net:/gluster/vol01

Interesting. The volume is now a distributed-replicated volume, and has a two by two configuration giving four nodes in total. This configuration is similar to RAID 10, where you stripe across mirrors. The previous two nodes would be one mirror, and the two new nodes would become the second mirror. I confirmed this by copying files to my gluster file system and then checking the bricks to see where the files landed:

$ cd /gluster

$ cp /etc/services file1

$ cp /etc/services file2

$ cp /etc/services file3

$ cp /etc/services file4

$ ls -la

total 2648
drwxr-xr-x   4 root root   8192 Nov 27  2011 .
dr-xr-xr-x. 23 root root   4096 Nov 12 15:44 ..
drwxr-xr-x   2 root root  16384 Nov 27  2011 etc1
-rw-r--r--   1 root root 656517 Nov 27  2011 file1
-rw-r--r--   1 root root 656517 Nov 27  2011 file2
-rw-r--r--   1 root root 656517 Nov 27  2011 file3
-rw-r--r--   1 root root 656517 Nov 27  2011 file4
drwx------   2 root root  20480 Nov 26 21:11 lost+found

Four files were copied to the gluster file system, and it looks like two landed on each replicated pair of bricks. Here is the ls listing from the first pair (I pulled this from one of the two nodes):

$ ls -la

total 1328
drwxr-xr-x. 4 root root   4096 Nov 27 10:00 .
drwxr-xr-x. 3 root root   4096 Nov 26 17:53 ..
drwxr-xr-x. 2 root root   4096 Nov 27 10:00 etc1
-rw-r--r--. 1 root root 656517 Nov 27 10:00 file1
-rw-r--r--. 1 root root 656517 Nov 27 10:01 file2
drwx------. 2 root root  16384 Nov 26 21:11 lost+found

And here is the listing from the second replicated pair of bricks:

$ ls -la

total 1324
drwxr-xr-x   4 root root   4096 Nov 27 10:00 .
drwxr-xr-x   3 root root   4096 Nov 12 20:05 ..
drwxr-xr-x 126 root root  12288 Nov 27 10:00 etc1
-rw-r--r--   1 root root 656517 Nov 27 10:00 file3
-rw-r--r--   1 root root 656517 Nov 27 10:00 file4
drwx------   2 root root   4096 Nov 26 21:11 lost+found

So there you have it. Adding two more bricks with “add-brick” adds a new pair of replicated bricks, it doesn’t mirror the data between the old bricks and the new ones. Given the description of a distributed replicated volume in the official documentation this makes total sense. Now to play around with some of the other redundancy types.

Removing a gluster volume doesn’t remove the volume’s contents

I made another interesting discovery this weekend while playing around with the gluster volume deletion option. Prior to creating a volume with a new layout, I went through the documented process to remove my volume:

$ gluster volume stop glustervol01
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Stopping volume glustervol01 has been successful

$ gluster volume delete glustervol01
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
Deleting volume glustervol01 has been successful

I then re-created it using the documented process:

$ gluster volume create glustervol01 replica 2 transport tcp
fedora-cluster01.homefetch.net:/gluster/vol01
fedora-cluster02.homefetch.net:/gluster/vol01

Creation of volume glustervol01 has been successful. Please start the volume to access data.

$ gluster volume start glustervol01
Starting volume glustervol01 has been successful

Once the new volume was created and started I mounted in on my clients. When I went to access the volume I was quite intrigued to find that the data that was written to the previous gluster volume was still present:

$ ls -l

total 24
drwxr-xr-x 126 root root 12288 Nov 26  2011 etc
drwxr-xr-x 126 root root  4096 Nov 26 13:07 etc2
drwxr-xr-x 126 root root  4096 Nov 26 13:07 etc3
drwx------   2 root root  4096 Nov 26  2011 lost+found

Ey? Since ‘gluster delete volume’ spit out the message “Deleting volume will erase all information about the volume”, I figured the contents of the volume would be nuked (never assume, always confirm!). That doesn’t appear to be the case here. When you delete a volume the metadata that describes the volume is the only thing that is removed. It would be helpful if the developers noted this in the output above. I can see this causing headaches for folks down the road.

Some interesting insights on the gluster replicated volume replica value

While playing around with gluster, I had an interesting finding about the way gluster handles replicated volumes. The gluster volume I am using for testing is a replicated volume with a replica factor of 2 (the replica factor determines how many copies of your data will be made). I wanted to add a third replica to my volume, and thought it would be as simple as using the “add-brick” option:

$ gluster volume add-brick glustervol01 centos-cluster03.prefetch.net:/gluster/vol01
Incorrect number of bricks supplied 1 for type REPLICATE with count 2

Hmmmm — no go. At first I thought this was no big deal, I figured there was an option or setting I needed to change to increase my replica count. I couldn’t find an option in the official documentation, and after reading through a number of mailing list postings I came across a horrific finding. From what I have been able to gather so far you cannot add a third replica to a volume that was created with a replica count of 2. Erf!

Being somewhat curious, I was wondering if I could work around this limitation by creating a volume with a replica value higher than the number of bricks that were specified on the command line. This would allow me to grow the number of replicated bricks as needed, giving me some buffer room down the road. Well — this doesn’t work either:

$ gluster volume create glustervol01 replica 3 transport tcp \
fedora-cluster01.homefetch.net:/gluster/vol01 \
fedora-cluster02.homefetch.net:/gluster/vol01

number of bricks is not a multiple of replica count
Usage: volume create [stripe ] [replica ] [transport ]

So this leaves me with the following options to change my volume lay out:

1. Replace a brick with the “replace-brick” option.

2. Remove a brick with the “remove-brick” option and then add a new brick with the “add-brick” option.

3. Destroy my volume and re-create it with a replica factor of 3.

4. Add replica count bricks with the “add-brick” option.

So you may be asking why don’t you do #4 and add two more bricks to the volume to make gluster happy? There are two reasons:

1. I only have one node to add at this time (hardware doesn’t grow on trees).

2. Adding two more bricks with “add-brick” would create a distributed replicated volume. This doesn’t increase the replica factor for my data, it adds two more replicated bricks to the volume (see this post for additional detail).

As with ALL storage-related solutions, you need to do your homework prior to deploying gluster. Make sure you take into account how things will need to look down the road, and design your gluster solution around this vision (and make sure you have a good backup and recovery solution in place in case you need to make drastic changes). Also make sure to test out your vision to ensure it works as you expect it to. I’m a huge fan of beating the heck out of technology in a lab and learning as much as I can in a non-production environment. I don’t like getting bitten once something goes live and my users depend on it.

If someone is aware of a way to add a third replica to a volume please leave me a comment (as well as a link to documentation that talks about it) and I’ll update the blog entry. I’ve searched and searched and searched and have yet to come up with anything. If there truly is no way to expand the number of replicas in a volume I would consider this a serious limitation of gluster. With disk sizes growing like mad, I could definitely see it being useful to expand the replica factor for an existing volume. It won’t be too long before you can install a 10TB disk drive in your PC, and when those are $80 a pop a replica value of three doesn’t seem that unrealistic (just ask Adam Leventhal).

Centos 6 Linux VMs running inside vSphere 4.1 appear to dynamically discover new LUNs

I came across an interesting discovery yesterday while working on a CentOS 6 gluster node. The node was virtualized inside vSphere 4.1 and needed some additional storage added to it. I went into the VI client and added a new disk while the server was running, expecting to have to reboot or rescan the storage devices in the server. Well, I was pleasantly surprised when the following messages popped up on the console:

null

Nice, it looks like the device was added to the system dynamically! I ran dmesg to confirm:

$ dmesg | tail -14

mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 1, sas_addr 0x5000c295575f0957
scsi 2:0:1:0: Direct-Access     VMware   Virtual disk     1.0  PQ: 0 ANSI: 2
sd 2:0:1:0: [sdb] 75497472 512-byte logical blocks: (38.6 GB/36.0 GiB)
sd 2:0:1:0: [sdb] Write Protect is off
sd 2:0:1:0: [sdb] Mode Sense: 03 00 00 00
sd 2:0:1:0: [sdb] Cache data unavailable
sd 2:0:1:0: [sdb] Assuming drive cache: write through
sd 2:0:1:0: Attached scsi generic sg2 type 0
sd 2:0:1:0: [sdb] Cache data unavailable
sd 2:0:1:0: [sdb] Assuming drive cache: write through
 sdb: unknown partition table
sd 2:0:1:0: [sdb] Cache data unavailable
sd 2:0:1:0: [sdb] Assuming drive cache: write through
sd 2:0:1:0: [sdb] Attached SCSI disk

Rock on! In the past I’ve had to reboot virtual machines or rescan the storage devices to find new LUNs. This VM was configured with a LSI Logic SAS controller and is running CentOS 6. I’m not sure if something changed in the storage stack in CentOS 6, or if the SAS controller is the one to thank for this nicety. Either way I’m a happy camper, and I love it when things just work! :)

Defragmenting EXT4 file systems with e4defrag (coming soon to a distribution near you)

If you have been around the systems engineering field you have probably read about file system fragmentation at some point. This typically occurs when files are randomly updated over time, and the blocks that comprise the file get scattered over different areas of the disk. This causes the drives to perform more work since the drive heads have to perform additional seeks vs. being able to sequentially read data off of a given platter. Anytime you do sequential I/O vs. random I/O you are better off.

Several file systems provide tools to defragment file systems, and it looks like the EXT4 engineers are planning to come out with a similar tool for EXT4. The tool will be called e4defrag, and a test version of the tool is available in the e2fsprogs git repository. This utility will be super handy when it’s stable, and should assist with getting every last bit of performance out of your EXT4 file system.

Before I provide an overview of this tool I need to state that this tool is currently being released for testing, and there is the possibility of data corruption if you use it!! Do not use this utility on anything you need to keep around! This isn’t my opinion, this is the opinion from the developers themselves:

$ e4defrag -c /mnt

**************************************************
This is a release only for testing, and bugs may
which could corrupt your data.  Please invoke
with "-test" if you wish to use it at this time.
**************************************************
Usage	: e4defrag [-v] file...| directory...| device...
	: e4defrag  -c  file...| directory...| device...

With that said, e4defrag can be used to defragment a file, a directory or a file system that has been placed on a device. To see how fragmented your file system is you can run e4defrag with the “-c” option:

$ e4defrag -c -test /mnt

                             now/best          ratio
1. /mnt/mail/domaintable.db                      2/1            50.00%
2. /mnt/mail/virtusertable.db                    2/1            50.00%
3. /mnt/mail/mailertable.db                      2/1            50.00%
4. /mnt/file                                 72148/1             0.77%
5. /mnt/elinks.conf                              1/1             0.00%

 Total/best extents				74010/1860
 Fragmentation ratio				3.84%
 Fragmentation score				30.72
 [0-30 no problem: 31-55 a little bit fragmented: 55- needs defrag]
 This directory(/mnt) does not need defragmentation.

Nifty! The utility will print the files that are fragemented, the number of extents that are associated with the file, and the fragmentation ratio. Here is the full description of the fields from the source code:

struct frag_statistic_ino {
        int now_count;          /* the file's extents count of before defrag */
        int best_count;         /* the best file's extents count */
        __u64 size_per_ext;     /* size(KB) per extent */
        float ratio;            /* the ratio of fragmentation */
        char msg_buffer[PATH_MAX + 1];  /* pathname of the file */
};

To defragment a fragmented file, directory or device you can run e4defrag without the “-c” option:

$ e4defrag -test /mnt

This will take some time, and the result will hopefully be files that aren’t scattered all through out the hard drive. While this tool isn’t ready for primetime, it’s nice to know that it will be available down the road. You are on your own if you decide to use e4defrag. I provide zero warranties or assurances on the information provided. You are seriously putting your data at risk if you choose to ignore the various warnings provided here.

Using rpcdebug to debug Linux NFS client and server issues

Debugging NFS issues can sometimes be a chore, especially when you are dealing with busy NFS servers. Tools like nfswatch and nfsstat can help you understand what operations your server is servicing, but sometimes you need to get into the protocol guts to find out more info. There are a couple of ways you can do this. First you can capture NFS traffic with tcpdump and use the NFS protocol decoding features that are built into wireshark. This will allow you to see the NFS traffic that is going over the network, which in most cases is enough to solve a given caper.

Alternatively, you can use the rpcdebug utility to log additional debugging data on a Linux NFS client or server. The debugging data includes information on the RPC module, the lock manager module and the NFS client and server modules. The debugging flags you can set inside these modules can be displayed with the rcpdebug “-vh” option:

$ rpcdebug -vh

usage: rpcdebug [-v] [-h] [-m module] [-s flags...|-c flags...]
       set or cancel debug flags.

Module     Valid flags
rpc        xprt call debug nfs auth bind sched trans svcsock svcdsp misc cache all
nfs        vfs dircache lookupcache pagecache proc xdr file root callback client mount all
nfsd       sock fh export svc proc fileop auth repcache xdr lockd all
nlm        svc client clntlock svclock monitor clntsubs svcsubs hostcache xdr all

To enable a specific set of flags you can pass the name of a module and the flags to enable to rpcdebug:

$ rpcdebug -m nfsd -s proc
nfsd proc

In the example above I told the nfsd module to log debugging data about the RPC procedures that were received. If I hop on an NFS client and write 512-bytes to a file I would see the NFS procedures that were issued in the server’s messages file:

Nov  1 09:53:18 zippy kernel: nfsd: LOOKUP(3)   36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 foo1
Nov  1 09:53:18 zippy kernel: nfsd: CREATE(3)   36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 foo1
Nov  1 09:53:18 zippy kernel: nfsd: WRITE(3)    36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 512 bytes at 0
Nov  1 09:53:18 zippy kernel: nfsd: COMMIT(3)   36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 0@0

To log ALL of the nfsd debugging data possible, you can set flags to all. This will enable all of the flags in the module which will cause an excessive amount of data to be logged (don’t do this on production servers unless you have a specific need to do so!). Here is the output you would see from a single 512-byte write when all of the flags are enabled:

Nov  1 09:56:18 zippy kernel: nfsd_dispatch: vers 3 proc 4
Nov  1 09:56:18 zippy kernel: nfsd: ACCESS(3)   36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 0x1f
Nov  1 09:56:18 zippy kernel: nfsd: fh_verify(36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3)
Nov  1 09:56:18 zippy kernel: nfsd_dispatch: vers 3 proc 1
Nov  1 09:56:18 zippy kernel: nfsd: GETATTR(3)  36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3
Nov  1 09:56:18 zippy kernel: nfsd: fh_verify(36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3)
Nov  1 09:56:18 zippy kernel: nfsd_dispatch: vers 3 proc 4
Nov  1 09:56:18 zippy kernel: nfsd: ACCESS(3)   36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 0x2d
Nov  1 09:56:18 zippy kernel: nfsd: fh_verify(36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3)
Nov  1 09:56:18 zippy kernel: nfsd_dispatch: vers 3 proc 2
Nov  1 09:56:18 zippy kernel: nfsd: SETATTR(3)  36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3
Nov  1 09:56:18 zippy kernel: nfsd: fh_verify(36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3)
Nov  1 09:56:18 zippy kernel: nfsd_dispatch: vers 3 proc 7
Nov  1 09:56:18 zippy kernel: nfsd: WRITE(3)    36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 512 bytes at 0
Nov  1 09:56:18 zippy kernel: nfsd: fh_verify(36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3)
Nov  1 09:56:18 zippy kernel: nfsd: write complete host_err=512
Nov  1 09:56:18 zippy kernel: nfsd_dispatch: vers 3 proc 21
Nov  1 09:56:18 zippy kernel: nfsd: COMMIT(3)   36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3 0@0
Nov  1 09:56:18 zippy kernel: nfsd: fh_verify(36: 01070001 0003fe95 00000000 ec426dec 9f402ff3 d2aac8b3)

Now you may be asking yourself what is that stuff following the procedure name? The data logged along with the procedure name contains the contents of the data structures sent as part of a given procedure. You can view the data structures and their members by opening up the correct NFS RFC and checking the argument list. Here is what the NFSv3 RFC says about the argument passed to the WRITE procedure:

struct WRITE3args {
         nfs_fh3     file;
         offset3     offset;
         count3      count;
         stable_how  stable;
         opaque      data<>;
};

The structure contains the file handle to perform the operation on, the offset to perform the write and the number of bytes to write. It also contains a stable flag to tell the server it should COMMIT the data prior to returning a success code to the client. The last field contains the data itself.

Once you have a better understanding of the fields passed to a procedure you can cross reference the lines in the messages file with the dprintk() statements in the kernel. Here is the kernel source for the dprintk() statement in the VFS WRITE code:

dprintk("nfsd: WRITE    %s %d bytes at %d\n",
         SVCFH_fmt(&argp->fh),
         argp->len, argp->offset);

This matches up exactly with the contents in the log file, and the dprintk() code shows the order everything is logged. This is useful stuff, though it’s probably not something you will need to access daily. ;) I’m documenting it here so I can recall this procedure down the road.