Resource controls against fork bombs executed inside Solaris Zones

I came across this neat little tidbit on page 27 while reading through the pdf article UNDERSTANDING THE SECURITY CAPABILITIES OF SOLARIS™ ZONES SOFTWARE

As a test, I’m going to set this resource control on a zone and execute a fork bomb to see what appears in system logs.  This is pretty cool stuff! 

Miscellaneous Controls
One well-known method to over-consume system resources is a fork-bomb. This method does not necessarily consume a great deal of memory or CPU resources, but rather seeks to use up all of the process slots in the kernel’s process table. In the Solaris OS, a running process starts with just one thread of execution, also called a Light Weight Process (LWP). Many programs generate new threads, becoming multithreaded processes. By default, Solaris systems with a 64-bit kernel can run over 85,000 LWPs simultaneously. A booted zone that is not yet running any applications has approximately 100 to 150 LWPs. To prevent a zone from using too many LWPs, a limit can be set on their use. The following command sets a limit of 300 LWPs for a zone.
global# zonecfg -z web
zonecfg:web> set max-lwps=300
zonecfg:web> exit
global# zoneadm -z web reboot
 
 
 This parameter can be used, but should not be set so low that it impacts normal application operation. An accurate baseline for the number of LWPs for a given zone should be determined in order to set this valuable at an appropriate level. The number of LWPs used by a zone can be monitored using the following prstat command.
In this example, the web zone currently has 108 LWPs. This value changes as processes are created or exit. It should be inspected over a period of time in order to establish a more reliable baseline, and updated when the software, requirements, or workload change.
Using the max-lwps resource control successfully usually requires the use of a CPU control, such as the FSS or pools to ensure that there is enough CPU power in the global zone for the platform administrator to fix any problems that might arise.
 
global# prstat -LZ
[…]
ZONEID NLWP SWAP RSS MEMORY TIME CPU ZONE
0 248 468M 521M 8.6% 0:14:18 0.0% global
37 108 76M 61M 1.0% 0:00:00 0.0% web
Total: 122 processes, 356 lwps, load averages: 0.00, 0.00, 0.01

Viewing Solaris zone resource utilization

Solaris zones have been around for quite some time now, and provide low overhead execution environments for running application. Admins who need to understand how zones are utilizing CPU and memory resources typically turn to prstat, which provides the “-Z” option to view utilization by zone:

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
448 foo 165M 141M sleep 59 0 21:47:34 37% java/115
7713 matty 2292K 1740K cpu0 59 0 0:00:00 0.0% prstat/1
233 snmp 7348K 5336K sleep 59 0 0:00:46 0.0% snmpd/1
316 daemon 2100K 1344K sleep 60 -20 0:00:00 0.0% lockd/2
309 root 1752K 944K sleep 59 0 0:00:00 0.0% sac/1
492 root 8688K 7740K sleep 29 0 0:00:01 0.0% svc.startd/11
459 root 0K 0K sleep 60 – 0:00:00 0.0% zsched/1
129 root 2192K 1328K sleep 29 0 0:00:00 0.0% syseventd/14
123 root 2964K 2024K sleep 29 0 0:00:00 0.0% picld/5
202 root 2432K 976K sleep 59 0 0:00:00 0.0% cron/1
402 root 3520K 1328K sleep 59 0 0:00:00 0.0% sshd/1
136 root 6552K 3772K sleep 29 0 0:00:00 0.0% devfsadm/9
303 daemon 2432K 972K sleep 59 0 0:00:00 0.0% rpcbind/1
140 root 4376K 3120K sleep 59 0 0:00:02 0.0% nscd/24
137 daemon 3892K 2036K sleep 29 0 0:00:00 0.0% kcfd/3
236 nobody 3324K 1044K sleep 59 0 0:00:03 0.0% nrpe/1
827 root 9176K 8144K sleep 59 0 0:00:03 0.0% svc.configd/14
823 root 2160K 1248K sleep 59 0 0:00:00 0.0% init/1
421 smmsp 7072K 1492K sleep 59 0 0:00:00 0.0% sendmail/1
9 root 9516K 8504K sleep 29 0 0:00:03 0.0% svc.configd/15
ZONEID NPROC SWAP RSS MEMORY TIME CPU ZONE
0 35 213M 215M 1.3% 21:49:10 37% global
3 21 43M 50M 0.3% 0:00:15 0.0% zone1
1 21 43M 50M 0.3% 0:00:15 0.0% zone2
2 21 42M 50M 0.3% 0:00:16 0.0% zone3

While reviewing the Solaris zones mailing list last night, I noticed that Jeff Victor posted a link to a Perl script that can provide utilization data for zones. This script has a TON of potential, especially once it is able to report on network and disk utilization.

Cleaning up failed package installations

While attempting to install a Sun package this week, I encountered the following error:

$ pkgadd -d . MYpackage

## Waiting for up to <300> seconds for package administration commands to become available (another user is administering packages on zone <zoneA>)

^C

1 package was not processed!

After a bit of truss’ing, I noticed that the pkgadd commands were checking for the existence of files with the name .ai.pkg.zone.lock.<DYNAMICALLY_GENERATED_STRING> in /tmp. Based on a cursory inspection of the package utility source code, it appears these files are used as lock files to prevent multiple package commands from running at the same time. Since this was the only package installation running on the system, I logged into the zone and removed the stale lock file:

$ zlogin zoneA

$ rm /tmp/.ai.pkg.zone.lock-afdb66cf-1dd1-11b2-a049-000d560ddc3e

Once I removed this file, the package installed like a champ! Nice!

Zone update on attach functionality

If you’ve used the zone migration features (e.g., attach and detach) in Solaris, you may have bumped into issues when you tried to migrate a zone from one machine to another, and the servers didn’t have the same set of patches or packages installed. With Jerry’s putback of PSARC 2007/621 into opensolaris, this should be a thing of the past. Here are the bugs that were addressed by PSARC 2007/621:

PSARC 2007/621 zone update on attach
6480464 RFE: zoneadm attach should patch/update the zone to the new hosts level
6576592 RFE: zoneadm detach/attach should work between sun4u and sun4v architecture
6637869 zone attach doesn’t handle obsolete patches correctly

Thanks Jerry for this super super useful feature!

Brandz support for Solaris 8 and Linux 2.6 kernels

I was pleasantly surprised to find out this week that the brandz framework is being extended to support Linux 2.6 kernels, as well as binaries that were built to run on Solaris 8 hosts! This has lots and lots of potential, and would be a blessing for one of my previous employers (they have a lot of Solaris 8 hosts). I would like to send dibs out to the folks who are making this happen. ;) Niiiiiiiice!

Fixing a broken Solaris zone

I applied the latest set of patches to my x86 Solaris 10 server this morning, and after the server was rebooted I noticed that my zones didn’t start. When I ran the zoneadm utility with the “list” option, all of the zones were in the “installed” state (they should be in the running state since the autoboot variables was set to true):

$ zoneadm list -vc

  ID NAME             STATUS         PATH                          
   0 global           running        /                             
   - z1-t             installed      /zones/z1-t               
   - z2-d             installed      /zones/z2-d               
   - z3-p             installed      /zones/z3-p             

At first I thought the zones service might be in the maintenance state, but after reviewing the output from the svcs command, that theory turned out to be incorrect:

$ svcs -a | grep zones

online          8:39:22 svc:/system/zones:default

Since the box contained several critical services, I decided to start the zones by hand and perform mostmortem analysis after the zones were back up and operational. When I ran zoneadm with the the “boot” option and the name of the zone to boot, I was greeted with the following error:

$ zoneadm -z dns boot

zoneadm: zone 'dns': Failed to initialize privileges: No such file or directory
zoneadm: zone 'dns': call to zoneadmd failed

Oh good grief! After reviewing my notes, I noticed that I had applied patch 122663-06 (a libezonecfg patch) as part of the patch bundle. Configurable zone privileges are coming as part of Solaris 10 update 3, and it looks like they prematurely made their way into a Solaris 10 patch. Since I have not had a chance to play with configurable privileges, I decided to create a new zone to see if zonecfg worked ok, and also to see if configurable privileges required additional attributes:

$ zonecfg -z test

test: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:test> create
zonecfg:test> info
zonepath: 
autoboot: false
pool: 
inherit-pkg-dir:
        dir: /lib
inherit-pkg-dir:
        dir: /platform
inherit-pkg-dir:
        dir: /sbin
inherit-pkg-dir:
        dir: /usr
zonecfg:test> set zonepath=/zones/test
zonecfg:test> commit
ld.so.1: zonecfg: fatal: relocation error: file /usr/sbin/zonecfg: symbol zonecfg_add_index: referenced symbol not found
Killed

Well that isn’t good, and the output from the info command doesn’t seem to indicate that new attributes were added. I needed to get the box up and running, so I decided to try to back out patch 122663-06. When I ran patchrm to remove the patch, it bombed out since it wasn’t able to start the zones:

$ patchrm 122663-06

Validating patches...

Loading patches installed on the system...

Done!

Checking patches that you specified for removal.

Done!

Approved patches will be removed in this order:

122663-06 
Preparing checklist for non-global zone check...

Checking non-global zones...

Booting non-global zone dns for patch check...
 ERROR: unable to boot zone: problem running  on zone : error 1
zoneadm: zone 'dns': Failed to initialize privileges: No such file or directory
zoneadm: zone 'dns': call to zoneadmd failed

Can not boot non-global zone dns

Gak! Once I realized that backing out the patch with patchrm wouldn’t work, I decided to back up the libzonecfg shared library that 122663-06 had installed, and copy the previous version over it. To find the previous version, I used the find command in the /var/sadm directory:

$ cd /var/sadm

$ find . -name 122663-06

./pkg/SUNWcsr/save/pspool/SUNWcsr/save/122663-06
./pkg/SUNWcsr/save/122663-06
./pkg/SUNWzoneu/save/pspool/SUNWzoneu/save/122663-06
./pkg/SUNWzoneu/save/122663-06
./patch/122663-06

After I located the patch directories, I looked for the file named undo.Z. The undo.Z file contains a backup of each file the patch overwrites, and is used by the patchrm utility to restore a server to it’s previosu state. To find the right undo.Z file, I ran the find command in the pkg/SUNWzoneu directory, and then used the ls “-i” (print inode) and “-l” (long output) options to print the inode number and size of each undo.Z file I found:

$ ls -li ./pkg/SUNWzoneu/save/pspool/SUNWzoneu/save/122663-06/*.Z ./pkg/SUNWzoneu/save/122663-06/*.Z

    102041 -rw-r--r--   1 root     root      158534 Oct 29 08:56 ./pkg/SUNWzoneu/save/122663-06/undo.Z
    101589 -rw-r--r--   1 root     root      158534 Oct 29 08:56 ./pkg/SUNWzoneu/save/pspool/SUNWzoneu/save/122663-06/undo.Z

Since the size and timestamps on the files I located were identical (as a side note — I am curious why Sun keeps two copies of the undo.Z file. If anyone knows, please add your thoughts to the comment section. ), I copied one of the files to /tmp, and used the uncompress and pkgadd utilities to extract the file to /tmp/u:

$ cp undo.Z /tmp

$ cd /tmp

$ uncompress undo.Z

$ pkgadd -s /tmp/u -d undo

The following packages are available:
  1  SUNWzoneu     Solaris Zones (Usr)
                   (i386) 11.10.0,REV=2005.01.21.16.34

Select package(s) you wish to process (or 'all' to process
all packages). (default: all) [?,??,q]: 
Transferring  package instance

Once the packages were extracted to /tmp/u, I started to poke around to see which files were included in the package:

$ cd /tmp/u/*

$ find .

.
./pkginfo
./pkgmap
./install
./install/checkinstall
./install/postinstall
./reloc
./reloc/usr
./reloc/usr/lib
./reloc/usr/lib/amd64
./reloc/usr/lib/amd64/libzonecfg.so.1
./reloc/usr/lib/libzonecfg.so.1
./reloc/usr/share
./reloc/usr/share/lib
./reloc/usr/share/lib/xml
./reloc/usr/share/lib/xml/dtd
./reloc/usr/share/lib/xml/dtd/zonecfg.dtd.1

Since the new version of libzonecfg.so was most likely the cause of my problems, I backed up the shared libraries the patch had installed in /usr/lib and /usr/lib/amd, and then replaced these with the versions I had extracted to /tmp/u:

$ cd /usr/lib

$ pwd
/usr/lib

$ cp libzonecfg.so.1 libzonecfg.so.1.orig

$ cp /tmp/u/SUNWzoneu/reloc/usr/lib/libzonecfg.so.1 .

$ cd /usr/lib/amd

$ pwd
/usr/lib/amd

$ cp libzonecfg.so.1 libzonecfg.so.1.orig

$ cp /tmp/u/SUNWzoneu/reloc/usr/lib/amd/libzonecfg.so.1 .

Once the old version of libzonecfg was in place, I was able to boot my zones without issue:

$ zoneadm -z dns boot

This experience once again leads me to wonder if Sun actually tests patches prior to sending them out to the public (this is my second bad experience in as many months). Now to schedule another downtime to properly back out the patch. :(