Creating clustered file systems with glusterfs on CentOS and Fedora Linux servers

I’ve been using gluster for the past few months, and so far I am really impressed with what I’m seeing. For those that haven’t used gluster, it is an open source clustered file system that can provides scalable storage on commodity hardware. As with all file systems and applications, gluster comes with it’s own vernacular. Here are some terms you will need to know if you are going to gluster it up:

brick – a unit of storage which consists of a server and directory path (i.e., server:/export)
translator – modules that are chained together to move data from point a to point b
volume – a collection of bricks

At the simplest level, a brick contains the name of a server and a directory on that server where stuff will be stored. You combine bricks into volumes based on performance and reliability requirements, and these volumes are then shared with gluster clients through CIFS, NFS or via the glusterfs file system. This is a crazy simple overview, and you should definitely read official documentation if you are planning to use gluster.

Getting gluster up and working on a Linux server is crazy easy. To start, you will need to install a few packages on your gluster server nodes (these will become your bricks). If you are using CentOS you will need to build the packages from source. Fedora 16 users can install the packages with yum:

$ yum install glusterfs flusterfs-fuse glusterfs-server glusterfs-vim glusterfs-devel

Once the packages are installed you can use the glusterfs utilities “-V” option to verify everything is working:

$ /usr/sbin/glusterfs -V
glusterfs 3.2.4 built on Sep 30 2011 18:02:31
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc.
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

The key thing to note is the version and the fact that the command completed without error. Next up, you will need to allocate some storage to gluster. This storage will be the end destination for the data your clients write, so you should give some thought to how you are going to organize it. Do you want to use cheap storage and let gluster replicate the data for you? Do you want to use RAID protected storage and gluster striping? Do you want to combine these two for the best performance and availability you can get? That is only a decision you can make.

For my needs, I added a dedicated disk (/dev/sdb) to each of my gluster nodes. I then created a logical volume and EXT4 file system on that device, and mounted it up to /gluster/vol01:

$ pvcreate /dev/sdb

$ vgcreate GlusterVG /dev/sdb

$ lvcreate -l +100%FREE GlusterVG -n glustervol01

$ mkfs.ext4 /dev/mapper/GlusterVG-glustervol01

$ mount /dev/mapper/GlusterVG-glustervol01 /gluster/vol01

Once you have storage you will need to create a gluster server volume definition file. This file defines the translators you want to use, as well as the location of the storage glsuter will use. Here is the volume definition file I created on each of my server nodes:

$ cat /etc/glusterfs/glusterfsd.vol

volume posix
  type storage/posix
  option directory /gluster/vol01
end-volume

volume locks
  type features/locks
  subvolumes posix
end-volume

volume brick
  type performance/io-threads
  option thread-count 8
  subvolumes locks
end-volume

volume server
  type protocol/server
  option transport-type tcp
  option auth.addr.brick.allow 192.168.1.*
  subvolumes brick
end-volume

The configuration file above defines four translators. First we have the server translator which handles communications, this is linked to the io-threads translator which creates X threads to handle operations, this is linked to the locks translator which handles locking and this is linked to the posix translator which writes the actual data to a backing store (/gluster/vol01 in this case). This works as a pipeline, so you can add translators to the flow to gain additional functionality. Kinda nifty!

Now Depending on how you plan to use gluster, you may need to add additional translators to your configuration file. We’ll keep it simple for now and use the basic configuration listed above. To start the gluster infrastrucutre on a CentOS or Fedora server you should chkconfig on (or systemctl enable) the glusterd and glusterfsd services on so they will start at boot:

$ chkconfig glusterd on

$ chkconfig glusterfsd on

and then service start (or systemctl start) the two services:

$ service glusterd restart

Restarting glusterd (via systemctl):                       [  OK  ]

$ service glusterfsd restart

Restarting glusterfsd (via systemctl):                     [  OK  ]

For each server (gluster can scale to hundreds and hundreds of nodes) that will act as a gluster cluster node you will need to perform the tasks above. I don’t have hundreds of machines, only the three measly machines listed below:

fedora-cluster01 – storage brick
fedora-cluster02 – storage brick
fedora-client01 – gluster client

Starting the services will bring the gluster infrastructure up, but the nodes will have no idea what cluster they are in or which volumes they are a part of. To add nodes to a cluster of storage nodes you can login to one of your gluster nodes and probe the other servers you want to add to the cluster. Probing is done with the gluster utilities “peer probe” option:

$ gluster peer probe fedora-cluster02

Probing a server should merge the node into your cluster. To see which nodes are active cluster members you can use the gluster utilities “peer status” option:

$ gluster peer status

Number of Peers: 1

Hostname: fedora-cluster02
Uuid: ca62e586-8edf-42ea-9fd1-5e11dff29db1
State: Peer in Cluster (Connected)

Sweet, in addition to the local machine we have one cluster partener for a total of two nodes. With a working cluster we can now move on to creating volumes. I mentioned above that a volume contains one or more storage bricks that are organized by reliability and availability requirements. There are several volumes types available:

Distributed – Files are distributed to a brick in the cluster. This provides no data redundancy.
Replicated – Files are replicated to one or more bricks based on the replica value. This provides data redundancy if replica is greater than one.
Striped – Stripes data across bricks. This provides no data reundancy.
Distributed striped – Stripes data across two or more nodes in the cluster. This provides no data redundancy.
Distributed replication volumes – Distributes files across replicated bricks in a cluster. This provides data redundancy if replica is greater that one.

Since I only have two nodes in my cluster, I am going to create a replicated volume across two bricks:

$ gluster volume create glustervol01 replica 2 transport tcp fedora-cluster01:/gluster/vol01 fedora-cluster02:/gluster/vol01
Creation of volume glustervol01 has been successful. Please start the volume to access data.

My volume (glustervol01) has a replication factor of two (two copies of my data will be made) and the data will be distributed to both of the bricks listed on the command line. To start the volume so clients can use it you can use the gluster utilities “volume start” option:

$ gluster volume start glustervol01
Starting volume glustervol01 has been successful

To verify the volume is operational we can use the gluster utilities “volume info” option:

$ gluster volume info glustervol01

Volume Name: glustervol01
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: fedora-cluster01:/gluster/vol01
Brick2: fedora-cluster02:/gluster/vol01

Well hot diggity, we have a working gluster volume. Excellent! To configured a CentOS or Fedora client to use this volume we will need to install the glsuterfs and glusterfs-fuse packages:

$ rpm -q -a | grep gluster
glusterfs-3.2.4-1.fc16.x86_64
glusterfs-fuse-3.2.4-1.fc16.x86_64

If the packages aren’t installed you can build form source if you are using CentOS, or install via yum if you are on a Fedora 16 machine:

$ yum install glusterfs glusterfs-fuse

After the packages are installed we can use the mount command to make the gluster volume available to the client. The mount command takes as arguments the file system type (glusterfs), the name of one of the bricks, the name of the volume and the location to mount the volume:

$ mount -t glusterfs fedora-cluster01:/glustervol01 /gluster

We can verify it mounted with the df command:

$ df -h /gluster

Filesystem                        Size  Used Avail Use% Mounted on
fedora-cluster01:/glustervol01   36G  176M   34G   1% /gluster

To bring this mount up each time the client boots we can add an entry similar to the following to /etc/fstab:

fedora-cluster01:/glustervol01 /gluster glusterfs defaults,_netdev 0 0

Now you may be asking ourself how does gluster replicate data if only one server is specified to the mount command? The initial mount is used to gather information about the volume, and from there on out the server will communicate with all of the nodes defined in the volume definition file. Now that we’ve gone through all this work to mount a volume, we can poke it with dd to make sure it works:

$ cd /gluster

$ dd if=/dev/zero of=foo bs=1M count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 359.066 s, 23.9 MB/s

Bingo! Our volume is operational, and any data that is written to it will be replicated to two storage bricks in the volume. While I only have a couple months of gluster experience under my belt, there are a few issues that will stop me from deploying it into production:

1. Debugging gluster issues in production are difficult. Rolling out debug binaries or stopping gluster volumes to debug a problem isn’t an option for most shop. I should note that the article that I referenced was written last year, and there are various options now available (–debug flag, detailed logs, profiling, etc.) to assist with debugging problems. I’ve seen references to debug builds and unmounts on the mailing list, so this leads me to believe this is still an issue. I’ll know if the debug options are up to the task when I start trying to break things. :)

2. There is no easy way to find out the mapping of files to bricks. In most cases this shouldn’t matter, but for recoverability I would like to see a tool added.

3. Security is based on IP addresses and subnets. Stronger authentication and encryption of the control and data paths are a necessity at most institutions that have to comply with federal and state laws.

4. If a node fails and is brought back up at a later time, the files that were changed aren’t replicated to the faulted server until they are accessed. The documentation talks about running a find across the volume from a client, but this seems kinda kludgy.

I really see a bright future for gluster. It fills a niche (a scalable clustered file system) that has been largely untouched, and I truly seeing it taking it off. It’s also open source, so you can tinker around with it to your hearts content. I’ve be adding more posts related to gluster performance, debugging and maintenance down the road.

8 thoughts on “Creating clustered file systems with glusterfs on CentOS and Fedora Linux servers”

  1. Nice article. I’m not sure if we’ve met, but in case we haven’t (or for others reading this) I’m the project lead for HekaFS which is based on GlusterFS. Since the acquisition of Gluster by Red Hat, I’m also a member of the GlusterFS architecture team. So yes, I’m a bit biased. ;)

    “2. There is no easy way to find out the mapping of files to bricks”

    Actually there is a semi-secret way. It’s a magic extended attribute – i.e. it doesn’t really exist on disk, but when you ask for it GlusterFS will construct a value. In this case the command you’d want is something like “getfattr -n trusted.glusterfs.pathinfo /some/glusterfs/file” which will show information about where a file was actually placed through distribution, replication, etc. This has actually been there, but it was enhanced recently (and the format of the returned data changed) to support Hadoop queries for the same information.

    “3. Security is based on IP addresses and subnets.”

    Yeah, lame huh? This is part of the reason HekaFS exists, and allows you to use full SSL certificates (with or without a PKI) to perform both authentication and network encryption. This change is making its way through the GlusterFS patch-review system, with another to follow implementing authorization as well as authentication based on the SSL certificates. That’s just I/O path; separate work is in progress to secure the management interfaces as well.

    “4. If a node fails and is brought back up at a later time, the files that were changed aren’t replicated to the faulted server until they are accessed.”

    This is also a common pain point. In fact I submitted a patch and instructions several months ago. That code probably still works, but is not being actively promoted because I’m working on some new replication code that incorporates that idea plus several others to improve performance as well as repair time/transparency.

    I’m glad you like GlusterFS so far. I hope you’ll like it even more when some of these sub-projects come to fruition. :)

  2. Hello again. I have a question for you, the speed of your dd test was: 23.9 MB/s. Is the speed of your net card 1000Mbps and it is connected to a standard PCI bus? If not, what is the speed of your net card?

    Thanks in advance.

Leave a Reply

Your email address will not be published. Required fields are marked *