Slides from my introduction to gluster presentation

I gave a talk on Gluster this evening at our local UNIX users group. I had a good time talking, and enjoyed interacting with everyone that came out. The introduction to Gluster presentation slides are available in the presentation section of my website, and I’ll make sure to get them linked to the users group website. I would like to thank everyone for coming out. I’m truly blessed to have so many awesome friends, and to be able to share information with other admins in the area. Now go forth and Gluster!!

Which file system should I use with Gluster?

I was reading through the Gluster 3.2.5 release notes today and came across the following blurb:

Red Hat recommends XFS when formatting the disk sub-system. XFS supports metadata journaling, which facilitates quicker crash recovery. The XFS file system can also be de-fragmented and enlarged while mounted and active. Any other POSIX compliant disk file system, such as Ext3, Ext4, ReiserFS may also work, but has not been tested widely.

This is pretty cool, and it’s the first time I recall seeing Redhat recommending XFS over EXT3 or EXT4. Good to know!

Another interesting finding about gluster replicas

In a previous post I talked about my problems getting gluster to expand the number of replicas in a volume. While experimenting with the gluster utilities “add-brick” option I wanted to see if adding two more bricks would replicate the existing data across four bricks (two old, two new), or if the two new bricks would be a replica pair and the two previous bricks would be a replica pair. To see what would happen I added two more bricks:

$ gluster volume add-brick glustervol01
centos-cluster01.homefetch.net:/gluster/vol01
centos-cluster02.homefetch.net:/gluster/vol01

Add Brick successful

And then checked out the status of the volume:

$ gluster volume info glustervol01

Volume Name: glustervol01
Type: Distributed-Replicate
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: fedora-cluster01.homefetch.net:/gluster/vol01
Brick2: fedora-cluster02.homefetch.net:/gluster/vol01
Brick3: centos-cluster01.homefetch.net:/gluster/vol01
Brick4: centos-cluster02.homefetch.net:/gluster/vol01

Interesting. The volume is now a distributed-replicated volume, and has a two by two configuration giving four nodes in total. This configuration is similar to RAID 10, where you stripe across mirrors. The previous two nodes would be one mirror, and the two new nodes would become the second mirror. I confirmed this by copying files to my gluster file system and then checking the bricks to see where the files landed:

$ cd /gluster

$ cp /etc/services file1

$ cp /etc/services file2

$ cp /etc/services file3

$ cp /etc/services file4

$ ls -la

total 2648
drwxr-xr-x   4 root root   8192 Nov 27  2011 .
dr-xr-xr-x. 23 root root   4096 Nov 12 15:44 ..
drwxr-xr-x   2 root root  16384 Nov 27  2011 etc1
-rw-r--r--   1 root root 656517 Nov 27  2011 file1
-rw-r--r--   1 root root 656517 Nov 27  2011 file2
-rw-r--r--   1 root root 656517 Nov 27  2011 file3
-rw-r--r--   1 root root 656517 Nov 27  2011 file4
drwx------   2 root root  20480 Nov 26 21:11 lost+found

Four files were copied to the gluster file system, and it looks like two landed on each replicated pair of bricks. Here is the ls listing from the first pair (I pulled this from one of the two nodes):

$ ls -la

total 1328
drwxr-xr-x. 4 root root   4096 Nov 27 10:00 .
drwxr-xr-x. 3 root root   4096 Nov 26 17:53 ..
drwxr-xr-x. 2 root root   4096 Nov 27 10:00 etc1
-rw-r--r--. 1 root root 656517 Nov 27 10:00 file1
-rw-r--r--. 1 root root 656517 Nov 27 10:01 file2
drwx------. 2 root root  16384 Nov 26 21:11 lost+found

And here is the listing from the second replicated pair of bricks:

$ ls -la

total 1324
drwxr-xr-x   4 root root   4096 Nov 27 10:00 .
drwxr-xr-x   3 root root   4096 Nov 12 20:05 ..
drwxr-xr-x 126 root root  12288 Nov 27 10:00 etc1
-rw-r--r--   1 root root 656517 Nov 27 10:00 file3
-rw-r--r--   1 root root 656517 Nov 27 10:00 file4
drwx------   2 root root   4096 Nov 26 21:11 lost+found

So there you have it. Adding two more bricks with “add-brick” adds a new pair of replicated bricks, it doesn’t mirror the data between the old bricks and the new ones. Given the description of a distributed replicated volume in the official documentation this makes total sense. Now to play around with some of the other redundancy types.

Removing a gluster volume doesn’t remove the volume’s contents

I made another interesting discovery this weekend while playing around with the gluster volume deletion option. Prior to creating a volume with a new layout, I went through the documented process to remove my volume:

$ gluster volume stop glustervol01
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Stopping volume glustervol01 has been successful

$ gluster volume delete glustervol01
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
Deleting volume glustervol01 has been successful

I then re-created it using the documented process:

$ gluster volume create glustervol01 replica 2 transport tcp
fedora-cluster01.homefetch.net:/gluster/vol01
fedora-cluster02.homefetch.net:/gluster/vol01

Creation of volume glustervol01 has been successful. Please start the volume to access data.

$ gluster volume start glustervol01
Starting volume glustervol01 has been successful

Once the new volume was created and started I mounted in on my clients. When I went to access the volume I was quite intrigued to find that the data that was written to the previous gluster volume was still present:

$ ls -l

total 24
drwxr-xr-x 126 root root 12288 Nov 26  2011 etc
drwxr-xr-x 126 root root  4096 Nov 26 13:07 etc2
drwxr-xr-x 126 root root  4096 Nov 26 13:07 etc3
drwx------   2 root root  4096 Nov 26  2011 lost+found

Ey? Since ‘gluster delete volume’ spit out the message “Deleting volume will erase all information about the volume”, I figured the contents of the volume would be nuked (never assume, always confirm!). That doesn’t appear to be the case here. When you delete a volume the metadata that describes the volume is the only thing that is removed. It would be helpful if the developers noted this in the output above. I can see this causing headaches for folks down the road.

Some interesting insights on the gluster replicated volume replica value

While playing around with gluster, I had an interesting finding about the way gluster handles replicated volumes. The gluster volume I am using for testing is a replicated volume with a replica factor of 2 (the replica factor determines how many copies of your data will be made). I wanted to add a third replica to my volume, and thought it would be as simple as using the “add-brick” option:

$ gluster volume add-brick glustervol01 centos-cluster03.prefetch.net:/gluster/vol01
Incorrect number of bricks supplied 1 for type REPLICATE with count 2

Hmmmm — no go. At first I thought this was no big deal, I figured there was an option or setting I needed to change to increase my replica count. I couldn’t find an option in the official documentation, and after reading through a number of mailing list postings I came across a horrific finding. From what I have been able to gather so far you cannot add a third replica to a volume that was created with a replica count of 2. Erf!

Being somewhat curious, I was wondering if I could work around this limitation by creating a volume with a replica value higher than the number of bricks that were specified on the command line. This would allow me to grow the number of replicated bricks as needed, giving me some buffer room down the road. Well — this doesn’t work either:

$ gluster volume create glustervol01 replica 3 transport tcp \
fedora-cluster01.homefetch.net:/gluster/vol01 \
fedora-cluster02.homefetch.net:/gluster/vol01

number of bricks is not a multiple of replica count
Usage: volume create [stripe ] [replica ] [transport ]

So this leaves me with the following options to change my volume lay out:

1. Replace a brick with the “replace-brick” option.

2. Remove a brick with the “remove-brick” option and then add a new brick with the “add-brick” option.

3. Destroy my volume and re-create it with a replica factor of 3.

4. Add replica count bricks with the “add-brick” option.

So you may be asking why don’t you do #4 and add two more bricks to the volume to make gluster happy? There are two reasons:

1. I only have one node to add at this time (hardware doesn’t grow on trees).

2. Adding two more bricks with “add-brick” would create a distributed replicated volume. This doesn’t increase the replica factor for my data, it adds two more replicated bricks to the volume (see this post for additional detail).

As with ALL storage-related solutions, you need to do your homework prior to deploying gluster. Make sure you take into account how things will need to look down the road, and design your gluster solution around this vision (and make sure you have a good backup and recovery solution in place in case you need to make drastic changes). Also make sure to test out your vision to ensure it works as you expect it to. I’m a huge fan of beating the heck out of technology in a lab and learning as much as I can in a non-production environment. I don’t like getting bitten once something goes live and my users depend on it.

If someone is aware of a way to add a third replica to a volume please leave me a comment (as well as a link to documentation that talks about it) and I’ll update the blog entry. I’ve searched and searched and searched and have yet to come up with anything. If there truly is no way to expand the number of replicas in a volume I would consider this a serious limitation of gluster. With disk sizes growing like mad, I could definitely see it being useful to expand the replica factor for an existing volume. It won’t be too long before you can install a 10TB disk drive in your PC, and when those are $80 a pop a replica value of three doesn’t seem that unrealistic (just ask Adam Leventhal).

Installing gluster on a CentOS machine via rpmbuild

I talked previously about my experience getting gluster up and running on Fedora and CentOS Linux servers. The installation process as it currently stands is different between Fedora and CentOS servers. The Fedora package maintainers have build RPMs for gluster, so you can use yum to install everything needed to run gluster:

$ yum install glusterfs flusterfs-fuse glusterfs-server glusterfs-vim glusterfs-devel

Gluster packages aren’t currently available for CentOS 6 (at least they aren’t in extras or centosplus as of this morning), so you are required to build from source if you want to use CentOS as your base operating system. The build process is pretty straight forward, and I’ll share my notes and gotchas with you.

Before you compile a single piece of source code you will need to make sure the development tools group, as well as the rpcbind, readline, fuse and rpm-devel packages are installed. If these aren’t installed you can run yum to add them to your system:

$ yum -y groupinstall “Development tools”

$ yum -y install fuse fuse-devel rpcbind readline-devel libibverbs-devel rpm-devel

Once the pre-requisites are installed you can download and and build gluster:

$ wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/glusterfs-3.2.5.tar.gz

$ rpmbuild -ta glusterfs-3.2.5.tar.gz

The rpmbuild utilities “-ta” option (build source and binary packages from an archive) will build RPMs from a tar archive, and the packages that are produced will be placed in the rpmbuild directory in your home directory once rpmbuild does its magic:

$ cd /home/matty/rpmbuild/RPMS/x86_64

$ ls -la

total 5896
drwxr-xr-x. 2 root root    4096 Nov 26 17:32 .
drwxr-xr-x. 3 root root    4096 Nov 26 17:32 ..
-rw-r--r--. 1 root root 1895624 Nov 26 17:32 glusterfs-core-3.2.5-1.el6.x86_64.rpm
-rw-r--r--. 1 root root 3988860 Nov 26 17:32 glusterfs-debuginfo-3.2.5-1.el6.x86_64.rpm
-rw-r--r--. 1 root root   49260 Nov 26 17:32 glusterfs-fuse-3.2.5-1.el6.x86_64.rpm
-rw-r--r--. 1 root root   50732 Nov 26 17:32 glusterfs-geo-replication-3.2.5-1.el6.x86_64.rpm
-rw-r--r--. 1 root root   35032 Nov 26 17:32 glusterfs-rdma-3.2.5-1.el6.x86_64.rpm

The glusterfs.spec file in the tar archive you downloaded includes the RPM specification, so you can extract the archive and review this file if you are curious what rpmbuild is being instructed to do. To install the packages above we can use our good buddy rpm:

$ rpm -ivh *

Preparing...                ########################################### [100%]
   1:glusterfs-core         ########################################### [ 20%]
   2:glusterfs-fuse         ########################################### [ 40%]
   3:glusterfs-rdma         ########################################### [ 60%]
   4:glusterfs-geo-replicati########################################### [ 80%]
   5:glusterfs-debuginfo    ########################################### [100%]

If the packages installed successfully you can configure your gluster node and start the glusterd service:

$ chkconfig glusterd on

$ service glusterd start

While not quite as easy as ‘yum install gluster*’, it’s still pretty darn simple to get gluster installed and operational on a CentOS Linux server.