Some interesting insights on the gluster replicated volume replica value

While playing around with gluster, I had an interesting finding about the way gluster handles replicated volumes. The gluster volume I am using for testing is a replicated volume with a replica factor of 2 (the replica factor determines how many copies of your data will be made). I wanted to add a third replica to my volume, and thought it would be as simple as using the “add-brick” option:

$ gluster volume add-brick glustervol01 centos-cluster03.prefetch.net:/gluster/vol01
Incorrect number of bricks supplied 1 for type REPLICATE with count 2

Hmmmm — no go. At first I thought this was no big deal, I figured there was an option or setting I needed to change to increase my replica count. I couldn’t find an option in the official documentation, and after reading through a number of mailing list postings I came across a horrific finding. From what I have been able to gather so far you cannot add a third replica to a volume that was created with a replica count of 2. Erf!

Being somewhat curious, I was wondering if I could work around this limitation by creating a volume with a replica value higher than the number of bricks that were specified on the command line. This would allow me to grow the number of replicated bricks as needed, giving me some buffer room down the road. Well — this doesn’t work either:

$ gluster volume create glustervol01 replica 3 transport tcp \
fedora-cluster01.homefetch.net:/gluster/vol01 \
fedora-cluster02.homefetch.net:/gluster/vol01

number of bricks is not a multiple of replica count
Usage: volume create [stripe ] [replica ] [transport ]

So this leaves me with the following options to change my volume lay out:

1. Replace a brick with the “replace-brick” option.

2. Remove a brick with the “remove-brick” option and then add a new brick with the “add-brick” option.

3. Destroy my volume and re-create it with a replica factor of 3.

4. Add replica count bricks with the “add-brick” option.

So you may be asking why don’t you do #4 and add two more bricks to the volume to make gluster happy? There are two reasons:

1. I only have one node to add at this time (hardware doesn’t grow on trees).

2. Adding two more bricks with “add-brick” would create a distributed replicated volume. This doesn’t increase the replica factor for my data, it adds two more replicated bricks to the volume (see this post for additional detail).

As with ALL storage-related solutions, you need to do your homework prior to deploying gluster. Make sure you take into account how things will need to look down the road, and design your gluster solution around this vision (and make sure you have a good backup and recovery solution in place in case you need to make drastic changes). Also make sure to test out your vision to ensure it works as you expect it to. I’m a huge fan of beating the heck out of technology in a lab and learning as much as I can in a non-production environment. I don’t like getting bitten once something goes live and my users depend on it.

If someone is aware of a way to add a third replica to a volume please leave me a comment (as well as a link to documentation that talks about it) and I’ll update the blog entry. I’ve searched and searched and searched and have yet to come up with anything. If there truly is no way to expand the number of replicas in a volume I would consider this a serious limitation of gluster. With disk sizes growing like mad, I could definitely see it being useful to expand the replica factor for an existing volume. It won’t be too long before you can install a 10TB disk drive in your PC, and when those are $80 a pop a replica value of three doesn’t seem that unrealistic (just ask Adam Leventhal).

7 Comments

adrian K  on November 27th, 2011

I’m of the opinion that Gluster isn’t quite ready yet. We had an awful time with it taking an entire box down on large transfers.

The workaround of turning caching off helped some, but only reduced it from a daily occurrence to weekly.

Hoping RedHat’s acquisition does some very good things for what’s a very promising product

matty  on November 27th, 2011

@adrian — I’m of the same opinion. Great product, but it needs some help. Specifically, the documentation needs to be revamped to explain things in more detail. I’ve been figuring things out via trial and error since the documentation doesn’t cover a number of things I personally think it should. It will be interesting to see what Redhat does with the project.

Another interesting discovery about the gluster replica value  on November 30th, 2011

[…] a previous post I talked about my problems getting gluster to expand the number of replicas in a volume. While experimenting with the gluster utilities “add-brick” option I wanted to see if […]

John Mark  on November 30th, 2011

Hi there,

I would love to recruit you to help with the continued development of GlusterFS. We are now seeking input on roadmap and project development. As such, we’d love to have knowledgeable people step up to be a guiding force on the project. Feel free to stop by the IRC channel sometime: #gluster on irc.freenode.net

Thanks,
John Mark
Gluster Community Guy

django  on December 6th, 2011

Well, I think this is what you want (generated by hand – local read for each node):
>>>>
volume server1
type protocol/client
option transport-type tcp
option remote-host a.b.c.d
option transport.socket.nodelay on
option remote-port 6996
option remote-subvolume brick1
end-volume

volume server2
type protocol/client
option transport-type tcp
option remote-host a.b.c.e
option transport.socket.nodelay on
option remote-port 6996
option remote-subvolume brick1
end-volume

volume server3
type protocol/client
option transport-type tcp
option remote-host a.b.c.f
option transport.socket.nodelay on
option remote-port 6996
option remote-subvolume brick1
end-volume

volume mirror-0
type cluster/replicate
option read-subvolume server1
subvolumes server1 server2 server3
end-volume

volume iocache
type performance/io-cache
option cache-size `echo $(( $(grep ‘MemTotal’ /proc/meminfo | sed ‘s/[^0-9]//g’) / 5120 ))`MB
option cache-timeout 1
subvolumes mirror-0
end-volume

volume quickread
type performance/quick-read
option cache-timeout 1
option max-file-size 64kB
subvolumes iocache
end-volume

volume writebehind
type performance/write-behind
option cache-size 4MB
subvolumes quickread
end-volume

volume statprefetch
type performance/stat-prefetch
subvolumes writebehind
end-volume

<<<>>>
volume posix1
type storage/posix
option directory /data
end-volume

volume locks1
type features/locks
subvolumes posix1
end-volume

volume brick1
type performance/io-threads
option thread-count 32
subvolumes locks1
end-volume

volume server-tcp
type protocol/server
option transport-type tcp
option auth.addr.brick1.allow *
option transport.socket.listen-port 6996
option transport.socket.nodelay on
subvolumes brick1
end-volume
<<<<

pfaff  on December 29th, 2011

I have a similar situation today which lead me to this discussion.

In my case I want to convert from a distributed volume to a distributed replicated volume.

As you indicated in your option #3, my thought at this time is that the way to go about this is to delete the existing replicated volume – data will still exist as was on the volume member bricks – and then create a new d-r volume, and finally trigger a self heal. I’ve done this and it appears to be working exactly as expected.

If this is “The Way” then in your case you would do something similar; delete the d-r-2 volume and create a d-r-3 volume and heal.

I don’t know if this is the only way or the best way to achieve the end result but I don’t know of any alternative yet.

I hope this helps.

suwalski  on April 22nd, 2013

Thank you Matty for putting this together. It was a great starting point to help find a solution to this problem.

Looks like this issue has gotten resolved in gluster 3.3. As of 3.3, the gluster add-brick command is as follows:

volume add-brick [ ] … – add brick to volume

The replica count can be increased or decreased as bricks are added and removed. It does appear to work.

Leave a Comment