Some interesting insights on the gluster replicated volume replica value

While playing around with gluster, I had an interesting finding about the way gluster handles replicated volumes. The gluster volume I am using for testing is a replicated volume with a replica factor of 2 (the replica factor determines how many copies of your data will be made). I wanted to add a third replica to my volume, and thought it would be as simple as using the “add-brick” option:

$ gluster volume add-brick glustervol01 centos-cluster03.prefetch.net:/gluster/vol01
Incorrect number of bricks supplied 1 for type REPLICATE with count 2

Hmmmm — no go. At first I thought this was no big deal, I figured there was an option or setting I needed to change to increase my replica count. I couldn’t find an option in the official documentation, and after reading through a number of mailing list postings I came across a horrific finding. From what I have been able to gather so far you cannot add a third replica to a volume that was created with a replica count of 2. Erf!

Being somewhat curious, I was wondering if I could work around this limitation by creating a volume with a replica value higher than the number of bricks that were specified on the command line. This would allow me to grow the number of replicated bricks as needed, giving me some buffer room down the road. Well — this doesn’t work either:

$ gluster volume create glustervol01 replica 3 transport tcp \
fedora-cluster01.homefetch.net:/gluster/vol01 \
fedora-cluster02.homefetch.net:/gluster/vol01

number of bricks is not a multiple of replica count
Usage: volume create [stripe ] [replica ] [transport ]

So this leaves me with the following options to change my volume lay out:

1. Replace a brick with the “replace-brick” option.

2. Remove a brick with the “remove-brick” option and then add a new brick with the “add-brick” option.

3. Destroy my volume and re-create it with a replica factor of 3.

4. Add replica count bricks with the “add-brick” option.

So you may be asking why don’t you do #4 and add two more bricks to the volume to make gluster happy? There are two reasons:

1. I only have one node to add at this time (hardware doesn’t grow on trees).

2. Adding two more bricks with “add-brick” would create a distributed replicated volume. This doesn’t increase the replica factor for my data, it adds two more replicated bricks to the volume (see this post for additional detail).

As with ALL storage-related solutions, you need to do your homework prior to deploying gluster. Make sure you take into account how things will need to look down the road, and design your gluster solution around this vision (and make sure you have a good backup and recovery solution in place in case you need to make drastic changes). Also make sure to test out your vision to ensure it works as you expect it to. I’m a huge fan of beating the heck out of technology in a lab and learning as much as I can in a non-production environment. I don’t like getting bitten once something goes live and my users depend on it.

If someone is aware of a way to add a third replica to a volume please leave me a comment (as well as a link to documentation that talks about it) and I’ll update the blog entry. I’ve searched and searched and searched and have yet to come up with anything. If there truly is no way to expand the number of replicas in a volume I would consider this a serious limitation of gluster. With disk sizes growing like mad, I could definitely see it being useful to expand the replica factor for an existing volume. It won’t be too long before you can install a 10TB disk drive in your PC, and when those are $80 a pop a replica value of three doesn’t seem that unrealistic (just ask Adam Leventhal).

7 thoughts on “Some interesting insights on the gluster replicated volume replica value”

  1. I’m of the opinion that Gluster isn’t quite ready yet. We had an awful time with it taking an entire box down on large transfers.

    The workaround of turning caching off helped some, but only reduced it from a daily occurrence to weekly.

    Hoping RedHat’s acquisition does some very good things for what’s a very promising product

  2. @adrian — I’m of the same opinion. Great product, but it needs some help. Specifically, the documentation needs to be revamped to explain things in more detail. I’ve been figuring things out via trial and error since the documentation doesn’t cover a number of things I personally think it should. It will be interesting to see what Redhat does with the project.

  3. Hi there,

    I would love to recruit you to help with the continued development of GlusterFS. We are now seeking input on roadmap and project development. As such, we’d love to have knowledgeable people step up to be a guiding force on the project. Feel free to stop by the IRC channel sometime: #gluster on irc.freenode.net

    Thanks,
    John Mark
    Gluster Community Guy

  4. Well, I think this is what you want (generated by hand – local read for each node):
    >>>>
    volume server1
    type protocol/client
    option transport-type tcp
    option remote-host a.b.c.d
    option transport.socket.nodelay on
    option remote-port 6996
    option remote-subvolume brick1
    end-volume

    volume server2
    type protocol/client
    option transport-type tcp
    option remote-host a.b.c.e
    option transport.socket.nodelay on
    option remote-port 6996
    option remote-subvolume brick1
    end-volume

    volume server3
    type protocol/client
    option transport-type tcp
    option remote-host a.b.c.f
    option transport.socket.nodelay on
    option remote-port 6996
    option remote-subvolume brick1
    end-volume

    volume mirror-0
    type cluster/replicate
    option read-subvolume server1
    subvolumes server1 server2 server3
    end-volume

    volume iocache
    type performance/io-cache
    option cache-size `echo $(( $(grep ‘MemTotal’ /proc/meminfo | sed ‘s/[^0-9]//g’) / 5120 ))`MB
    option cache-timeout 1
    subvolumes mirror-0
    end-volume

    volume quickread
    type performance/quick-read
    option cache-timeout 1
    option max-file-size 64kB
    subvolumes iocache
    end-volume

    volume writebehind
    type performance/write-behind
    option cache-size 4MB
    subvolumes quickread
    end-volume

    volume statprefetch
    type performance/stat-prefetch
    subvolumes writebehind
    end-volume

    <<<>>>
    volume posix1
    type storage/posix
    option directory /data
    end-volume

    volume locks1
    type features/locks
    subvolumes posix1
    end-volume

    volume brick1
    type performance/io-threads
    option thread-count 32
    subvolumes locks1
    end-volume

    volume server-tcp
    type protocol/server
    option transport-type tcp
    option auth.addr.brick1.allow *
    option transport.socket.listen-port 6996
    option transport.socket.nodelay on
    subvolumes brick1
    end-volume
    <<<<

  5. I have a similar situation today which lead me to this discussion.

    In my case I want to convert from a distributed volume to a distributed replicated volume.

    As you indicated in your option #3, my thought at this time is that the way to go about this is to delete the existing replicated volume – data will still exist as was on the volume member bricks – and then create a new d-r volume, and finally trigger a self heal. I’ve done this and it appears to be working exactly as expected.

    If this is “The Way” then in your case you would do something similar; delete the d-r-2 volume and create a d-r-3 volume and heal.

    I don’t know if this is the only way or the best way to achieve the end result but I don’t know of any alternative yet.

    I hope this helps.

  6. Thank you Matty for putting this together. It was a great starting point to help find a solution to this problem.

    Looks like this issue has gotten resolved in gluster 3.3. As of 3.3, the gluster add-brick command is as follows:

    volume add-brick [ ] … – add brick to volume

    The replica count can be increased or decreased as bricks are added and removed. It does appear to work.

Leave a Reply

Your email address will not be published. Required fields are marked *