I discussed my first impressions of Solaris Cluster 3.2 a while back, and have been using it in various capacities ever since. One thing that I really like about Solaris Cluster is its ability to manage resources running in zones, and fail these resources over to other zones running on the same host, or a zone running on a secondary host. Additionally, Solaris Cluster allows you to migrate zones between nodes, which can be quite handy when resources are tied to a zone and can’t be managed as a scalable services. Configuring zone failover is a piece of cake, and I will describe how to do it in this blog post.
To get failover working, you will first need to download and install Solaris Cluster 3.2. You can grab the binaries from sun.com, and you can install them using the installer script that comes in the zip file:
$ unzip suncluster_3_2u2-ga-solaris-x86.zip
$ cd Solaris_x86
$ ./installer
Once you run through the installer, the binaries should be placed in /usr/cluster, and you should be ready to configure the cluster. Prior to doing so, you should add something similar to the following to /etc/profile to make life easier for cluster administrators:
PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/sfw/bin:/usr/cluster/bin
export PATH
TERM=vt100
export TERM
Also, if you selected the disabled remote services option during the Solaris install, you should reconfigure the rpc/bind service to allow connections from other cluster members (Solaris Cluster uses RPC extensively for cluster communications). This can be accomplished with the svccfg utility:
$ svccfg
svc:> **select rpc/bind**
svc:/network/rpc/bind> **setprop config/local_only=false**
Once the properties are adjusted, you can refresh the rpc/bind service to get these properties to go into effect:
$ svcadm refresh rpc/bind
Now that the environment is set up, you can run scinstall on each node to configure the cluster. I personally like to configure the first node and then add the second node to the cluster (this requires you to run scinstall on node one, then again on node two), but you can configured everything in one scinstall run if you prefer. Once scinstall completes and the nodes reboot, you should be able to run the cluster command to see if the cluster is operational:
$ cluster status
=== Cluster Nodes ===
--- Node Status ---
Node Name Status
--------- ------
snode1 Online
snode2 Online
=== Cluster Transport Paths ===
Endpoint1 Endpoint2 Status
--------- --------- ------
snode1:e1000g2 snode2:e1000g2 Path online
snode1:e1000g1 snode2:e1000g1 Path online
=== Cluster Quorum ===
--- Quorum Votes Summary ---
Needed Present Possible
------ ------- --------
1 1 1
--- Quorum Votes by Node ---
Node Name Present Possible Status
--------- ------- -------- ------
snode1 1 1 Online
snode2 0 0 Online
=== Cluster Device Groups ===
--- Device Group Status ---
Device Group Name Primary Secondary Status
----------------- ------- --------- ------
--- Spare, Inactive, and In Transition Nodes ---
Device Group Name Spare Nodes Inactive Nodes In Transistion Nodes
----------------- ----------- -------------- --------------------
--- Multi-owner Device Group Status ---
Device Group Name Node Name Status
----------------- --------- ------
=== Cluster Resource Groups ===
Group Name Node Name Suspended State
---------- --------- --------- -----
=== Cluster Resources ===
Resource Name Node Name State Status Message
------------- --------- ----- --------------
=== Cluster DID Devices ===
Device Instance Node Status
--------------- ---- ------
/dev/did/rdsk/d1 snode1 Ok
snode2 Ok
/dev/did/rdsk/d3 snode1 Ok
/dev/did/rdsk/d5 snode2 Ok
=== Zone Clusters ===
--- Zone Cluster Status ---
Name Node Name Zone HostName Status Zone Status
---- --------- ------------- ------ -----------
In the cluster status output above, we can see that we have a 2-node cluster that contains the hosts named snode1 and snode2. If there are no errors in the status output, you can register the HAStoragePlus resource type (this manages disk storage, and allows volumes and pools to failover between node) with the cluster. This can be accomplished with the clresourcetype command:
$ clresourcetype register SUNW.HAStoragePlus
Next you will need to create a resource group, which will contain all of the zone resources:
$ clresourcegroup create hazone-rg
Once the resource group is created, you will need to add a HAStoragePlus resource to manage the UFS file system(s) or ZFS pool your zone lives on. In the example below, a ZFS pool named hazone-pool resource is added to manage the ZFS pool named hazonepool:
$ clresource create -t SUNW.HAStoragePlus -g hazone-rg -p Zpools=hazonepool -p AffinityOn=True hazone-zpool
After the storage is configured, you will need to update DNS or /etc/hosts with the name and IP address that you plan to assign to the highly available zone (this is the hostname that hosts will use to access services in the highly available zone). For simplicity purposes, I added an entry to /etc/hosts on each node:
# Add a hazone-lh entry to DNS or /etc/hosts
192.168.1.23 hazone-lh
Next you will need to create a logical hostname resource. This resource type is used to manage interface failover, which allows one or more IP addresses to float between cluster nodes. To create a logical hostname resource, you can use the clreslogicalhostname utility:
$ clreslogicalhostname create -g hazone-rg -h hazone-lh hazone-lh
Now that the storage and logical hostname resources are configured, you can bring the resource group that contains these resources online:
$ clresourcegroup online -M hazone-rg
If the cluster, clresourcegroup and clresource status commands list everything in an online state, we can create a zone with the zonecfg and zoneadm commands. The zone needs to be installed on each node so the zone gets put into the installed state, and the Sun documentation recommends removing the installed zone on the first node prior to installing it on the second node. This will work, though I think you can play with the index file to simplify this process (this is unsupported though). Once the zones are installed, you should failover the shared storage to each node in the cluster, and boot the zones to be extra certain. If this works correctly, then you are ready to register the SUNW.gds resource type:
$ clresourcetype register SUNW.gds
The SUNW.gds resource type provides the cluster hooks to bring the zone online, and will optionally start one or more services in a zone. To configure the resource type, you will need to create a configuration file that describes the resources used by the zone, the resource group the resources are part of, and the logical hostname to use with the zone. Here is an example configuration file I used to create my highly available zone:
$ cat /etc/zones/sczbt_config
# The resource group that contains the resources the zones depend on
RG=hazone-rg
# The name of the zone resource to create
RS=hazone-zone
# The directory where this configuration file is stored
PARAMETERDIR=/etc/zones
SC_NETWORK=true
# Name of the logical hostname resource
SC_LH=hazone-lh
# Name of the zone you passed to zonecfg -z
Zonename=hazone
Zonebrand=native
Zonebootopt=""
Milestone="multi-user-server"
FAILOVER=true
# ZFS pool that contains the zone
HAS_RS=hazone-zpool
The Solaris Cluster highly available guide for containers describes each of these parameters, so I won’t go into detail on the individual options. To tell the cluster framework that it will be managing the zone, you can execute the sczbt_register command passing it the configuration file you created as an argument:
$ cd /opt/SUNWsczone/sczbt/util
$ ./sczbt_register -f /etc/zones/sczbt_config
Once the zone is tied into the cluster framework, you can bring the zone resource group (and the zone) online with the clresourcegroup command:
$ clresourcegroup online -n snode2 hazone-rg
If the zone came online (which it should if everything was executed above), you should see the following:
$ clresourcegroup status
=== Cluster Resource Groups ===
Group Name Node Name Suspended Status
---------- --------- --------- ------
hazone-rg snode1 No Offline
snode2 No Online
$ zoneadm list -vc
ID NAME STATUS PATH BRAND IP
0 global running / native shared
1 hazone running /hazonepool/zones/hazone native shared
$ zlogin hazone zonename
hazone
If you have services that you want to start and stop when you bring your zone online, you can use SMF or the ServiceStartCommand, ServiceStopCommand and ServiceProbeCommand SUNW.gds configuration options. Here are a couple of sample entries that could be added to the configuration file listed above:
ServiceStartCommand="/usr/local/bin/start-customapp"
ServiceStopCommand="/usr/local/bin/stop-customapp"
ServiceProbeCommand="/usr/local/bin/probe-customapp"
As the names indicate, ServiceStartCommand contains the command to run to start the service, ServiceStopCommand contains the command to run to stop the service, and ServiceProbeCommand contains the command to run to verify the service is up and operational. This is super useful stuff, and it’s awesome that zones will now failover to a secondary node when a server fails, or when a critical error occurs and a zone is unable to run.
I was reviewing the hardware configuration on one of my Centos Linux 5.3 hosts this past weekend, and was curious which chipset was in use and if the host supported the AMD virtualization extensions. To get a high level overview of the devices installed in the system, I looked through /etc/sysconfig/hwconf ( this file is populated at boot time with the current hardware configuration):
$ more /etc/sysconfig/hwconf
-
class: OTHER
bus: PCI
detached: 0
driver: shpchp
desc: "nVidia Corporation CK804 PCIE Bridge"
vendorId: 10de
deviceId: 005d
subVendorId: 0000
subDeviceId: 0000
pciType: 1
pcidom: 0
pcibus: 80
pcidev: e
pcifn: 0
-
.....
To see if the CPUs supported the AMD virtualization extensions, I poked around /proc/cpuinfo:
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 254
stepping : 1
cpu MHz : 1000.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm
bogomips : 2008.99
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 254
stepping : 1
cpu MHz : 2800.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm
bogomips : 5625.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
After reviewing the processor details, I use lspci to extract some additional details from the PCI buses:
$ lspci -v |more
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
Subsystem: Sun Microsystems Computer Corp. Unknown device 534a
Flags: bus master, 66MHz, fast devsel, latency 0
Capabilities: [44] HyperTransport: Slave or Primary Interface
Capabilities: [e0] HyperTransport: MSI Mapping
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
Subsystem: Sun Microsystems Computer Corp. Unknown device 534a
Flags: bus master, 66MHz, fast devsel, latency 0
I/O ports at
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
Subsystem: Sun Microsystems Computer Corp. Unknown device 534a
Flags: 66MHz, fast devsel
I/O ports at 1000 [size=32]
I/O ports at 7000 [size=64]
I/O ports at 7040 [size=64]
Capabilities: [44] Power Management version 2
Once I looked through the lspci output, I ran dmidecode to gather the SMBIOS data (this has the chipset information):
$ dmidecode
Handle 0x0004, DMI type 4, 35 bytes.
Processor Information
Socket Designation: CPU0-Socket 940
Type: Central Processor
Family: Opteron
Manufacturer: AMD
ID: 51 0F 02 00 FF FB 8B 07
Signature: Extended Family 0, Model 5, Stepping 1
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
MMX (MMX technology supported)
FXSR (Fast floating-point save and restore)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
Version: AMD
Voltage: 1.2 V
External Clock: 200 MHz
Max Speed: 3000 MHz
Current Speed: 2800 MHz
Status: Populated, Enabled
Upgrade: None
L1 Cache Handle: 0x0008
L2 Cache Handle: 0x0009
L3 Cache Handle: Not Provided
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
I like the fact that dmidecode breaks down the processor flags for you, since it saved me a round-trip to the kernel source code. I found exactly what I needed in the output above, and am now off to purchase another machine (one that has 2 PCI-X slots and virtualization extensions) for my lab.
I recently patched one of my Solaris 10 hosts, and decided to test out the zone update on attach functionality that is now part of Solaris 10 update 6. The update on attach feature allows detached zones to get patched when they are attached to a host, which can be rather handy if you are moving zones around your infrastructure. To test this functionality, I first detached a zone from the host I was going to patch:
$ zoneadm -z zone detach
$ zoneadm list -vc
ID NAME STATUS PATH BRAND IP
0 global running / native shared
- zone1 configured /zones/zone1 native shared
Once the zone was detached, I applied the latest Solaris patch bundle and rebooted the server. When the system came back up, I tried to attach the zone:
$ zoneadm -z zone1 attach
These patches installed on the source system are inconsistent with this system:
118668: version mismatch
(17) (19)
118669: version mismatch
(17) (19)
119060: version mismatch
(44) (45)
119091: version mismatch
(31) (32)
119214: version mismatch
(17) (18)
119247: version mismatch
(34) (35)
119253: version mismatch
(29) (31)
119255: version mismatch
(59) (65)
119314: version mismatch
(24) (26)
119758: version mismatch
(12) (14)
119784: version mismatch
(07) (10)
120095: version mismatch
(21) (22)
120200: version mismatch
(14) (15)
120223: version mismatch
(29) (31)
120273: version mismatch
(23) (25)
120411: version mismatch
(29) (30)
120544: version mismatch
(11) (14)
120740: version mismatch
(04) (05)
121119: version mismatch
(13) (15)
121309: version mismatch
(14) (16)
121395: version mismatch
(01) (03)
122213: version mismatch
(28) (32)
122912: version mismatch
(13) (15)
123896: version mismatch
(05) (10)
124394: version mismatch
(08) (09)
124629: version mismatch
(09) (10)
124631: version mismatch
(19) (24)
125165: version mismatch
(12) (13)
125185: version mismatch
(08) (11)
125333: version mismatch
(03) (05)
125540: version mismatch
(04) (06)
125720: version mismatch
(24) (28)
125732: version mismatch
(02) (04)
125953: version mismatch
(17) (18)
126364: version mismatch
(06) (07)
126366: version mismatch
(12) (14)
126420: version mismatch
(01) (02)
126539: version mismatch
(01) (02)
126869: version mismatch
(02) (03)
136883: version mismatch
(01) (02)
137122: version mismatch
(03) (06)
137128: version mismatch
(02) (05)
138224: version mismatch
(02) (03)
138242: version mismatch
(01) (05)
138254: version mismatch
(01) (02)
138264: version mismatch
(02) (03)
138286: version mismatch
(01) (02)
138372: version mismatch
(02) (06)
138628: version mismatch
(02) (07)
138857: version mismatch
(01) (02)
138867: version mismatch
(01) (02)
138882: version mismatch
(01) (02)
These patches installed on this system were not installed on the source system:
125556-02
138889-08
139100-01
139463-02
139482-01
139484-05
139499-04
139501-02
139561-02
139580-02
140145-01
140384-01
140456-01
140775-03
141009-01
141015-01
As you can see in the above output, the zone refused to attach because the zone patch database differed from the global zone patch database. To synchronize the two, I added the “-u” option (update the zone when it is attached to a host) to the zoneadm command line:
$ zoneadm -z zone1 attach -u
Getting the list of files to remove
Removing 1209 files
Remove 197 of 197 packages
Installing 1315 files
Add 197 of 197 packages
Updating editable files
The file within the zone contains a log of the zone update.
Once the zone was updated, I was able to boot the zone without issue:
$ zoneadm -z zone1 boot
$ zoneadm list -vc
ID NAME STATUS PATH BRAND IP
0 global running / native shared
4 zone1 running /zones/zone1 native shared
This is pretty sweet, and I can see myself using this functionality (along with live upgrade) in the future!
While preparing my presentation for the Atlanta UNIX users group this past weekend, I accidentally created a VNIC with the wrong name. Instead of removing it and recreating it, I decided to test out the “rename-link” subcommand that was introduced as part of project Clearview. The rename feature worked swimmingly, and with a single dladm command I was able to rename the vnic6 interface to vnic5:
$ dladm create-vnic -l e1000g0 vnic6
$ dladm rename-link vnic6 vnic5
$ dladm show-vnic
LINK OVER SPEED MACADDRESS MACADDRTYPE VID
vnic1 switch0 0 2:8:20:b:5f:1 random 0
vnic5 e1000g0 1000 2:8:20:56:3a:82 random 0
vnic3 switch0 0 2:8:20:23:6c:7e random 0
vnic4 switch1 0 2:8:20:90:d9:26 random 0
vnic2 switch1 0 2:8:20:ea:7:18 random 0
I am super impressed with the work that the Sun networking team has done to simplify network interface management!
A few months back, I got wind from the world famous Jarod Jenson that Keith McGuigan was working on adding JSDT probes to Java. JSDT probes struck me as an extremely useful feature for realtime problem analysis, so I decided to instrument some sample code to see how they worked. After a couple of e-mail exchanges with Keith, I had a working Java build that supported JSDT. After reviewing a number of Java projects, I chose to instrument the PostgreSQL Java type IV driver, since I wanted to learn PostgreSQL along with JDBC. After reading through a fair amount of the PostgreSQL type IV driver code and Keith’s blog, I decided to add the following probes to the postgreSQL driver:
$ dtrace -l -m java_tracing
ID PROVIDER MODULE FUNCTION NAME
13199 postgresqljdbc1619 java_tracing unspecified transactioncommit
13200 postgresqljdbc1619 java_tracing unspecified transactionrollback
13201 postgresqljdbc1619 java_tracing unspecified executesqlstatementstart
13202 postgresqljdbc1619 java_tracing unspecified getconnectionfrompoolstop
13203 postgresqljdbc1619 java_tracing unspecified executepreparessqlstatementstart
13204 postgresqljdbc1619 java_tracing unspecified transactionautomcommit
13205 postgresqljdbc1619 java_tracing unspecified releaseconnectionstop
13206 postgresqljdbc1619 java_tracing unspecified getconnectionfrompoolstart
13207 postgresqljdbc1619 java_tracing unspecified executepreparedsqlstatementstop
13208 postgresqljdbc1619 java_tracing unspecified returnconnectiontopoolstop
13209 postgresqljdbc1619 java_tracing unspecified transactionsavepoint
13210 postgresqljdbc1619 java_tracing unspecified acquireconnectionstart
13211 postgresqljdbc1619 java_tracing unspecified returnconnectiontopoolstart
13212 postgresqljdbc1619 java_tracing unspecified releaseconnectionstart
13213 postgresqljdbc1619 java_tracing unspecified acquireconnectionstop
13214 postgresqljdbc1619 java_tracing unspecified executesqlstatementstop
DTrace uses providers to group probes, so the first thing I did was create a “postgresqljdbc " provider that would be visible to DTrace through the JSDT framework. This was achieved by defining an interface that extended the com.sun.tracing.Provider class:
package org.postgresql;
public interface postgresqljdbc extends com.sun.tracing.Provider {
// Probes that fire when a simple SQL statement begins / ends execution
void executesqlstatementstart(String query);
void executesqlstatementstop(String query);
// Probes that fire when a prepared statement begins / ends execution
void executepreparessqlstatementstart(String query);
void executepreparedsqlstatementstop(String query);
// Probe that fires when a new connection is established / released
void acquireconnectionstart(String host, int port, String database, String user);
void acquireconnectionstop(String host, int port, String database, String user);
void releaseconnectionstart(String host, int port, String user, String passwd);
void releaseconnectionstop(String host, int port, String user, String passwd);
// Probes that fire when a DB connection pool is accessed / released
void getconnectionfrompoolstart(String host, int port, String user, String passwd);
void getconnectionfrompoolstop(String host, int port, String user, String passwd);
void returnconnectiontopoolstart(String host, int port, String user, String passwd);
void returnconnectiontopoolstop(String host, int port, String user, String passwd);
// Probe that fires when a transaction starts, saves, commits or rolls back
void transactionautomcommit(boolean value);
void transactionsavepoint(String value);
void transactioncommit();
void transactionrollback();
}
Now that the provider was defined, I needed to import the Provider and ProviderFactory classes that are part of the JSDT framework, and add the code to instantiate a new postgresqljdbc object:
import com.sun.tracing.Provider;
import com.sun.tracing.ProviderFactory;
public class Driver implements java.sql.Driver
{
.....
public static ProviderFactory factory = ProviderFactory.getDefaultFactory();
public static postgresqljdbc provider = factory.createProvider(postgresqljdbc.class);
Next, I started adding probe points to the driver. This was super easy, since probe points are defined by a call to one of the provider methods that are defined in the interface that extends the Provider class. Here are the probes I added to capture SQL statement execution:
public boolean executeWithFlags(String p_sql, int flags) throws SQLException
{
org.postgresql.Driver.provider.executesqlstatementstart(p_sql);
checkClosed();
p_sql = replaceProcessing(p_sql);
Query simpleQuery = connection.getQueryExecutor().createSimpleQuery(p_sql);
execute(simpleQuery, null, QueryExecutor.QUERY_ONESHOT | flags);
this.lastSimpleQuery = simpleQuery;
org.postgresql.Driver.provider.executesqlstatementstop(p_sql);
return (result != null && result.getResultSet() != null);
}
In this specific code block, I added the “org.postgresql.Driver.provider.executesqlstatementstart” probe which fires when the SQL statement begins execution, and a “executesqlstatementstop” probe that fires after the SQL statement has executed. In both cases, the argument available to DTrace scripts is the SQL statement (represented as a String) that is being executed.
Next up, I ran ant to build a new postgreSQL driver, and executed a test program to verify everything worked:
$ export CLASSPATH=/home/matty/java/postgrestest:/home/matty/java/postgrestest/postgresql.jar
$ export JAVA_HOME=/usr/java/
$ java -Djdbc.drivers=org.postgresql.Driver -XX:+ExtendedDTraceProbes Test
The Test program does nothing more than create SQL connections, execute SQL statements and print the results to STDOUT. To allow me to see the difference between prepared and unprepared statements, I decided to add unique probes for each statement type.
After debugging a few issues, I was able to run the Test program and a few DTrace scripts I had created. The first script, sqlconnections.d, listed new connections to the database and the time it took to create these connections:
$ ./sqlconnections.d
TIMESTAMP HOST PORT DATABASE USER TIME
2008 Oct 10 15:59:47 localhost 5432 postgres matty 17048632
2008 Oct 10 15:59:51 localhost 5432 postgres matty 17597898
The next script, sqltimes.d, listed all of the SQL statements that were executed along with the total time they had run:
$ sqltimes.d
Tracing started (CNTRL+C to end)
^C
SQL statements:
SQL QUERY COUNT TIME
update matty set col1='one' 2 722359
select * from matty 4 1456311
Prepared statements:
SQL QUERY COUNT TIME
update matty set col1='one' 2 548765
select * from matty 4 1345987
The third script, sqltrace.d, displayed each SQL statement executed along with the time it had executed:
$ ./sqltrace.d
TIMESTAMP SQL QUERY TIME
1970 Mar 1 16:59:06 select * from matty 314021
1970 Mar 1 16:59:06 select * from matty 251793
1970 Mar 1 16:59:06 update matty set col1='one' 308901
I also had fun playing with transactions and JDBC connection pool probes, but that data wasn’t nearly as interesting as the SQL statement execution times listed above. If you are running openjdk on Solaris, and you want to get better visibility into your applications, JSDT may well be worth a look! I’m still not 100% sure what kind of performance impact these probes will have on an application, but will wait for the probes to be integrated into a mainsteam Java build prior to doing any real performance testing. Thanks Keith for answering my e-emails, and to team DTrace for creating the awesomeness that is DTrace!