Deploying highly available zones with Solaris Cluster 3.2

I discussed my first impressions of Solaris Cluster 3.2 a while back, and have been using it in various capacities ever since. One thing that I really like about Solaris Cluster is its ability to manage resources running in zones, and fail these resources over to other zones running on the same host, or a zone running on a secondary host. Additionally, Solaris Cluster allows you to migrate zones between nodes, which can be quite handy when resources are tied to a zone and can’t be managed as a scalable services. Configuring zone failover is a piece of cake, and I will describe how to do it in this blog post.

To get failover working, you will first need to download and install Solaris Cluster 3.2. You can grab the binaries from sun.com, and you can install them using the installer script that comes in the zip file:

$ unzip suncluster_3_2u2-ga-solaris-x86.zip


$ cd Solaris_x86


$ ./installer

Once you run through the installer, the binaries should be placed in /usr/cluster, and you should be ready to configure the cluster. Prior to doing so, you should add something similar to the following to /etc/profile to make life easier for cluster administrators:

PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/sfw/bin:/usr/cluster/bin
export PATH

TERM=vt100
export TERM

Also, if you selected the disabled remote services option during the Solaris install, you should reconfigure the rpc/bind service to allow connections from other cluster members (Solaris Cluster uses RPC extensively for cluster communications). This can be accomplished with the svccfg utility:

$ svccfg
svc:> select rpc/bind
svc:/network/rpc/bind> setprop config/local_only=false


Once the properties are adjusted, you can refresh the rpc/bind service to get these properties to go into effect:

$ svcadm refresh rpc/bind


Now that the environment is set up, you can run scinstall on each node to configure the cluster. I personally like to configure the first node and then add the second node to the cluster (this requires you to run scinstall on node one, then again on node two), but you can configured everything in one scinstall run if you prefer. Once scinstall completes and the nodes reboot, you should be able to run the cluster command to see if the cluster is operational:

$ cluster status

=== Cluster Nodes ===

--- Node Status ---

Node Name                                       Status
---------                                       ------
snode1                                          Online
snode2                                          Online


=== Cluster Transport Paths ===

Endpoint1               Endpoint2               Status
---------               ---------               ------
snode1:e1000g2          snode2:e1000g2          Path online
snode1:e1000g1          snode2:e1000g1          Path online


=== Cluster Quorum ===

--- Quorum Votes Summary ---

            Needed   Present   Possible
            ------   -------   --------
            1        1         1


--- Quorum Votes by Node ---

Node Name       Present       Possible       Status
---------       -------       --------       ------
snode1          1             1              Online
snode2          0             0              Online


=== Cluster Device Groups ===

--- Device Group Status ---

Device Group Name     Primary     Secondary     Status
-----------------     -------     ---------     ------


--- Spare, Inactive, and In Transition Nodes ---

Device Group Name   Spare Nodes   Inactive Nodes   In Transistion Nodes
-----------------   -----------   --------------   --------------------


--- Multi-owner Device Group Status ---

Device Group Name           Node Name           Status
-----------------           ---------           ------

=== Cluster Resource Groups ===

Group Name       Node Name       Suspended      State
----------       ---------       ---------      -----

=== Cluster Resources ===

Resource Name       Node Name       State       Status Message
-------------       ---------       -----       --------------

=== Cluster DID Devices ===

Device Instance              Node               Status
---------------              ----               ------
/dev/did/rdsk/d1             snode1             Ok
                             snode2             Ok

/dev/did/rdsk/d3             snode1             Ok

/dev/did/rdsk/d5             snode2             Ok


=== Zone Clusters ===

--- Zone Cluster Status ---

Name    Node Name    Zone HostName    Status    Zone Status
----    ---------    -------------    ------    -----------



In the cluster status output above, we can see that we have a 2-node cluster that contains the hosts named snode1 and snode2. If there are no errors in the status output, you can register the HAStoragePlus resource type (this manages disk storage, and allows volumes and pools to failover between node) with the cluster. This can be accomplished with the clresourcetype command:

$ clresourcetype register SUNW.HAStoragePlus

Next you will need to create a resource group, which will contain all of the zone resources:

$ clresourcegroup create hazone-rg

Once the resource group is created, you will need to add a HAStoragePlus resource to manage the UFS file system(s) or ZFS pool your zone lives on. In the example below, a ZFS pool named hazone-pool resource is added to manage the ZFS pool named hazonepool:

$ clresource create -t SUNW.HAStoragePlus -g hazone-rg -p Zpools=hazonepool -p AffinityOn=True hazone-zpool

After the storage is configured, you will need to update DNS or /etc/hosts with the name and IP address that you plan to assign to the highly available zone (this is the hostname that hosts will use to access services in the highly available zone). For simplicity purposes, I added an entry to /etc/hosts on each node:

# Add a hazone-lh entry to DNS or /etc/hosts
192.168.1.23 hazone-lh

Next you will need to create a logical hostname resource. This resource type is used to manage interface failover, which allows one or more IP addresses to float between cluster nodes. To create a logical hostname resource, you can use the clreslogicalhostname utility:

$ clreslogicalhostname create -g hazone-rg -h hazone-lh hazone-lh

Now that the storage and logical hostname resources are configured, you can bring the resource group that contains these resources online:

$ clresourcegroup online -M hazone-rg

If the cluster, clresourcegroup and clresource status commands list everything in an online state, we can create a zone with the zonecfg and zoneadm commands. The zone needs to be installed on each node so the zone gets put into the installed state, and the Sun documentation recommends removing the installed zone on the first node prior to installing it on the second node. This will work, though I think you can play with the index file to simplify this process (this is unsupported though). Once the zones are installed, you should failover the shared storage to each node in the cluster, and boot the zones to be extra certain. If this works correctly, then you are ready to register the SUNW.gds resource type:

$ clresourcetype register SUNW.gds


The SUNW.gds resource type provides the cluster hooks to bring the zone online, and will optionally start one or more services in a zone. To configure the resource type, you will need to create a configuration file that describes the resources used by the zone, the resource group the resources are part of, and the logical hostname to use with the zone. Here is an example configuration file I used to create my highly available zone:

$ cat /etc/zones/sczbt_config
# The resource group that contains the resources the zones depend on
RG=hazone-rg
# The name of the zone resource to create
RS=hazone-zone
# The directory where this configuration file is stored
PARAMETERDIR=/etc/zones
SC_NETWORK=true
# Name of the logical hostname resource
SC_LH=hazone-lh
# Name of the zone you passed to zonecfg -z
Zonename=hazone
Zonebrand=native
Zonebootopt=””
Milestone=”multi-user-server”
FAILOVER=true
# ZFS pool that contains the zone
HAS_RS=hazone-zpool

The Solaris Cluster highly available guide for containers describes each of these parameters, so I won’t go into detail on the individual options. To tell the cluster framework that it will be managing the zone, you can execute the sczbt_register command passing it the configuration file you created as an argument:

$ cd /opt/SUNWsczone/sczbt/util

$ ./sczbt_register -f /etc/zones/sczbt_config


Once the zone is tied into the cluster framework, you can bring the zone resource group (and the zone) online with the clresourcegroup command:

$ clresourcegroup online -n snode2 hazone-rg


If the zone came online (which it should if everything was executed above), you should see the following:

$ clresourcegroup status

=== Cluster Resource Groups ===

Group Name       Node Name       Suspended      Status
----------       ---------       ---------      ------
hazone-rg        snode1          No             Offline
                 snode2          No             Online



$ zoneadm list -vc

  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   1 hazone           running    /hazonepool/zones/hazone       native   shared



$ zlogin hazone zonename
hazone



If you have services that you want to start and stop when you bring your zone online, you can use SMF or the ServiceStartCommand, ServiceStopCommand and ServiceProbeCommand SUNW.gds configuration options. Here are a couple of sample entries that could be added to the configuration file listed above:

ServiceStartCommand=”/usr/local/bin/start-customapp”
ServiceStopCommand=”/usr/local/bin/stop-customapp”
ServiceProbeCommand=”/usr/local/bin/probe-customapp”

As the names indicate, ServiceStartCommand contains the command to run to start the service, ServiceStopCommand contains the command to run to stop the service, and ServiceProbeCommand contains the command to run to verify the service is up and operational. This is super useful stuff, and it’s awesome that zones will now failover to a secondary node when a server fails, or when a critical error occurs and a zone is unable to run.

Fixing Solaris Cluster device ID (DID) mismatches

I had to replace a disk in one of my cluster nodes, and was greeted with the following message once the disk was swapped and I checked the devices for consistency:

$ cldevice check

cldevice:  (C894318) Device ID "snode2:/dev/rdsk/c1t0d0" does not match physical device ID for "d5".
Warning: Device "snode2:/dev/rdsk/c1t0d0" might have been replaced.



To fix this issue, I used the cldevice utilities repair option:

$ cldevice repair
Updating shared devices on node 1
Updating shared devices on node 2


Once the repair operation updated the devids, cldevice ran cleanly:

$ cldevice check

Niiiiiiiiice!

The end of the Solaris Cluster /globaldevices file system

I have been working with [Sun|Solaris] for as long as I can recall. One thing that always annoyed me was the need to have a 512MB file system devoted to /globaldevices. With Sun Cluster 3.2 update 2, this is no longer the case!:

  >>> Global Devices File System <<<

    Each node in the cluster must have a local file system mounted on 
    /global/.devices/node@ before it can successfully participate 
    as a cluster member. Since the "nodeID" is not assigned until 
    scinstall is run, scinstall will set this up for you.

    You must supply the name of either an already-mounted file system or a
    raw disk partition which scinstall can use to create the global 
    devices file system. This file system or partition should be at least 
    512 MB in size.

    Alternatively, you can use a loopback file (lofi), with a new file 
    system, and mount it on /global/.devices/node@.

    If an already-mounted file system is used, the file system must be 
    empty. If a raw disk partition is used, a new file system will be 
    created for you.

    If the lofi method is used, scinstall creates a new 100 MB file system
    from a lofi device by using the file /.globaldevices. The lofi method 
    is typically preferred, since it does not require the allocation of a 
    dedicated disk slice.

    The default is to use /globaldevices.

    Is it okay to use this default (yes/no) [yes]? yes

    Testing for "/globaldevices" on "snode1" ... failed

/globaldevices is not a directory or file system mount point.
Cannot use "/globaldevices" on "snode1".

    Is it okay to use the lofi method (yes/no) [yes]?  yes



This is pretty sweet, and I am SOOOOO glad that /globaldevices as a separate file system is no more! Thanks goes out to the Solaris cluster team for making this wish list item a reality! :)

Running Oracle RAC inside Solaris zones

While perusing my mailing lists this morning, I came across a comment from Ellard Roush on running Oracle RAC inside a Solaris zone:

“There is an active project to support Oracle RAC running in a zone environment. This new feature will be a “Zone Cluster”, which is a virtual cluster where each virtual node is a Zone on a different physical node. The Zone Cluster will be able to support cluster applications in general. From the perspective of the application, the Zone Cluster will appear to be a cluster dedicated to that application.”

I noticed that the “cluster” branded zone type was putback in build 67, and the fact that Sun is going to implement virtual nodes rocks! This will allow you to completley abstract the cluster nodes from the hardware, and will offer a TON of additional flexibility. I can’t wait to play with this goodness!

Deploying highly available Oracle databases with Sun Cluster 3.2

While preparing for my Sun cluster 3.2 exam, I got a chance to play with a number of the Sun Cluster 3.2 data services. One of my favorite data services was the Oracle HA data service, which allows Sun cluster to monitor and failover databases in response to system and application failures.

Configuring the Oracle HA data service is amazingly easy, and it took me all of about 5 minutes (plus two hours reading through the Oracle data service documentation, installing Oracle and creating a database). Here are the steps I used to configure Sun Cluster 3.2 to failover an Oracle 10G database between two nodes:

Step 1. Run sccheck

The sccheck utility will verify the cluster configuration, and report errors it finds. This is useful for detecting storage and network misconfigurations early on in the configuration process. To use sccheck, you can run it without any options:

$ sccheck

If sccheck reports errors, make sure to correct them prior to moving on.

Step 2. Create a resource group for the Oracle resources

The Oracle resource group is a container for individual resources, which include virtual IP addresses, disk storage, and resources to manage the Oracle database and listener. To create a new resource group, you can run the clresourcegroup utility with the “create” option and the name of the resource group to create:

$ clresourcegroup create oracleha-rg

After the resource group is created, you will also need to put the resource group into a managed state with the clresourcegroup manage option (you can also set the resource group to the managed state when you create it):

$ clresourcegroup manage oracleha-rg

Once the resource group is in a managed state, you can bring it online with the clresourcegroup “online” option:

$ clrg online oracleha-rg

To verify that the resource group is online, you can run the clresourcegroup command with the “status” option and the name of the resource group to check (if you don’t pass a resource group name as an argument, the status of all resource groups is displayed):

$ clrg status oracleha-rg

=== Cluster Resource Groups ===

Group Name        Node Name      Suspended      Status
----------        ---------      ---------      ------
oracleha-rg       snode1         No             Offline
                      snode2         No             Online

Step 3. Register resource types with the cluster framework

Sun cluster comes with a number of resource types, which provide the brains needed to start, stop and monitor a specific type of resource (e.g., storage, virtual IP addresses, Oracle listeners, zones, etc.). To make these resource types available to the cluster framework, you need to use the clresourcetype utility to register them. The following commands will register the HAStoragePlus, oracle_server and oracle_listener resources with the cluster:

$ clresourcetype register SUNW.oracle_HAStoragePlus

$ clresourcetype register SUNW.oracle_listener

$ clresourcetype register SUNW.oracle_server

To check if a resource type is registered with the cluster framework, you can run the clresourcetype utility with the “list” option:

$ clresourcetype list

SUNW.LogicalHostname:2
SUNW.SharedAddress:2
SUNW.derby
SUNW.sctelemetry
SUNW.HAStoragePlus:4
SUNW.oracle_listener:5
SUNW.oracle_server:6

Step 4. Configure HAStoragePlus resource

The Oracle database will store datafiles and controlfiles on some type of shared storage. If the node that is running the database fails, the cluster framework will migrate the shared storage to another cluster member when the database is brought online on an alternate cluster node. To facilitate the storage migration, an HAStoragePlus resource needs to be configured with the shared storage resources. In addition to mounting and unmounting file systems, the HAStoragePlus resource will also monitor the health of the storage.

To create an HAStoragePlus resource that controls the ZFS pool named oracle, the clresource command can be executed with the “create” command, the “-t” option and the type of resource to create, the “-g” option and the name of the resource group to assign the resource to, the “Zpools” option and the name of the ZFS pool to manage, an affinity value to ensure that the storage will be online on the same node as the other resources in the resource group and the name of the resource to create:

$ clresource create -t SUNW.HAStoragePlus \
-g oracleha-rg \
-p Zpools=oracle \
-p AffinityOn=True oracledisk-res

To verify that the ZFS pool is imported and the file systems are online, you can run df to view the file systems, and the clresource utility with the “status” option to view the state of the HAStoragePlus resources:

$ df -h | egrep ‘(File|oracle)’

Filesystem             size   used  avail capacity  Mounted on
oracle                  35G    24K    35G     1%    /u01
oracle/u02              35G    24K    35G     1%    /u02
oracle/u03              35G    24K    35G     1%    /u03

$ clresource status oracledisk-res

=== Cluster Resources ===

Resource Name  Node Name   State       Status Message
-------------       ---------        -----        --------------
oracledisk-res    snode1         Offline      Offline
                       snode2         Online      Online

If the file systems don’t mount for some reason, you can check /var/adm/messages to see why.

Step 5. Configure a logical hostname resource

In order for clients to automatically adapt to database failures, they will need to be smart enough to reconnect to the database, and will also need to be configured to connect to a virtual IP address that is configured on the active database node. Virtual IP addresses are managed by the Sun cluster LogicalHostname resource type (you can also configure Scalable addresses that are mastered on more than one node, but that is a topic for a subsequent blog entry). To configure a LogicalHostname resource, you will first need to add the virtual IP address and the name associated with that address to /etc/hosts on each cluster node. After the entry is added, the clreslogicalhostname utility can be run with the “create” option, the “-g” option and the resource group to assign the logical hostname to, the “-h” option and the hostname associated with the virtual IP and a name to assign to the resource:

$ clreslogicalhostname create -g oracleha-rg -h oracle-lh oraclelh-res

To verify that the LogicalHostname resource is online, you can run ifconfig to see if the virtual IP address is configured on one of the interfaces:

$ grep oracle-lh /etc/hosts
192.168.1.32 oracle-lh

$ ifconfig -a | grep 192.168.1.32
inet 192.168.1.32 netmask ffffff00 broadcast 192.168.1.255

Step 6. Configure Oracle monitoring user

For the cluster framework to detect failures with Oracle (e.g., an Oracle internal error), an Oracle user needs to be created. This user is used by the Oracle data service to login to the database and check the operational status of the database. There are two ways the oracle data service checks to see if the Oracle database is working correctly. The first method used is a SQL select statement against the v$archive_dest and the v$sysstat views. If the values the agent retrieves from these views don’t change between consecutive runs, the agent will attempt to create a table to force the database to perform some work. For the Oracle data service to function correctly, an Oracle user needs to be created and assigned privileges to perform these operations.

To create a user named monitor that can create a table in the users tablespace and query the two views listed above, the following SQL can be used:

$ sqlplus “/ as sysdba”
sql> grant connect, resource to monitor identified by COMPLEXPASSWORD;
sql> alter user monitor default tablespace users quota 1m on users;
sql> grant select on v_$sysstat to monitor;
sql> grant select on v_$archive_dest to monitor;
sql> grant create session to monitor;
sql> grant create table to monitor;

To verify that the monitoring user works, you can connect to the database as the monitoring user:

$ sqlplus “monitor/COMPLEXPASSWORD@proddb”

Once connected, you can query both system views, and then attempt to create a table in the user’s default tablespace (in this case, the user is configured to use the users tablespace):

sql> select * from v$sysstat;

sql> select * from v$archive_dest;

sql> create table foo (a number);

sql> drop table foo;

Step 7. Create Oracle server and listener resources

After the file systems and IP addresses are online and the monitoring user is created, oracle_server and oracle_listener resources need to be created. The oracle_server resource is used to stop, start and monitor an Oracle database, and the oracle_listener resource is used to stop, start and monitor the Oracle listener.

To create a new oracle_server resource, the clresource utility can be run with the “create” option and a list of properties (e.g., where the alert log is located, the name of a user to monitor the database, the ORACLE_SID, the ORACLE_HOME, etc) that are needed to start and stop the database:

$ clresource create -g oracleha-rg \
-t SUNW.oracle_server \
-p Connect_string=monitor/COMPLEXPASSWORD \
-p ORACLE_SID=proddb \
-p ORACLE_HOME=/opt/oracle/product/10.2.0.1.0 \
-p Alert_log_file=/opt/oracle/admin/proddb/bdump/alert_proddb.log \
-p resource_dependencies=oracledisk-res oracledb-res

To create an oracle_listener resource, the clresource utility can be run with the “create” option and a list of properties (e.g., listener name, where the ORACLE_HOME resides, etc.) that are needed to start and stop the listener:

$ clresource create -g oracleha-rg \
-t SUNW.oracle_listener \
-p LISTENER_NAME=LISTENER_PRODDB \
-p ORACLE_HOME=/opt/oracle/product/10.2.0.1.0 \
-p resource_dependencies=oracledisk-res,oraclelh-res oraclelistener-res

There is one important thing to note in the commands listed above. Each resource contains a resource dependency, which the cluster framework uses to bring resource online in a specific order. The oracle_server needs the oraceldisk-res to be online prior to starting, since the database won’t be able to start unless the file systems are mounted. The oracle_listener also has a dependency on the oracle-disk-res resource, as well as a dependency on the logical host resource (this ensures that the virtual IP is up and operational prior to starting the listener). To verify that the resources are online and working, the clresource command can be run with the “status” option and the name of a resource:

$ clresource status oracledb-res

=== Cluster Resources ===

Resource Name       Node Name      State        Status Message
-------------       ---------      -----        --------------
oracledb-res        snode1         Offline      Offline
                    snode2         Online       Online

$ clrs status oraclelistener-res

=== Cluster Resources ===

Resource Name          Node Name     State      Status Message
-------------          ---------     -----      --------------
oraclelistener-res     snode1        Offline    Offline
                       snode2        Online     Online

Step 8: Verify the cluster configuration

Once all of the resources are created and the dependencies configured, you should verify that the Oracle database can successfully fail over to each node that is configured in the resource group’s node list. This can be accomplished by running the clresourcegroup utility with the “switch” option, the “-n” option and the name of the node to fail the resource group to, and the name of the resource group to failover:

$ clresourcegroup switch -n snode1 oracleha-rg

If you are able to successfully migrate the database to each node in the cluster, you now have a highly available Oracle database. While I touched on the basics of deploying a highly available Oracle database, I didn’t touch on correctly configuring the cluster framework, or the steps required to achieve maximum availability. I highly recommend reading through the Sun cluster concepts and Oracle data service guides prior to setting up your highly available databases. While the steps listed above work flawlessly in the authors environment, you should verify everything in a test environment prior to deploying anything in production.

Running processes in fixed time intervals

While messing around with Sun Cluster 3.2, I came across hatimerun. This nifty program can be used to run a program in a fixed amount of time, and kill the program if it runs longer that the time alloted to it. If hatimerun kills a program, it will return a status code of 99. If the program runs in it’s alloted time, hatimerun will return a status code of 0. To use hatimerun, you need to pass a time interval to the “-t” option, as well as a program to run in that time interval:

$ hatimerun -t 10 /bin/sleep 8

$ echo $?
0

$ hatimerun -t 10 /bin/sleep 12

$ echo $?
99

If anyone knows of a general purpose tool for doing this (preferably something that ships with Solaris or Redhat Enterprise Linux), I would appreciate it if you could leave a comment with further details.