Getting an accurate view of process memory usage on Linux hosts

Having debugged a number of memory-related issues on Linux, one thing I’ve always wanted was a tool to display proportional memory usage. Specifically, I wanted to be able to see how much memory was unique to a process, and have an equal portion of shared memory (libraries, SMS, etc.) added to this value. My wish came true a while back when I discovered the smem utility. When run without any arguments, smem will give you the resident set size (RSS), the unique set size (USS) and the proportional set size (PSS) which is the unique set size plus a portion of the shared memory that is being used by this process. This results in output similar to the following:

$ smem -r

  PID User     Command                         Swap      USS      PSS      RSS 
 3636 root     /usr/lib/virtualbox/Virtual        0  1151596  1153670  1165568 
 3678 matty    /usr/lib64/firefox-3.5.9/fi        0   189628   191483   203028 
 5779 root     /usr/bin/python /usr/bin/sm        0    38584    39114    40368 
 1847 root     /usr/bin/Xorg :0 -nr -verbo        0    34024    35874    92504 
 4103 matty    pidgin                             0    19364    21072    32412 
 3825 matty    gnome-terminal                     0    12388    13242    21992 
 3404 matty    python /usr/share/system-co        0    11596    12622    19216 
 3710 matty    gnome-screensaver                  0     9872    10287    14640 
 3283 matty    nautilus                           0     7104     8373    18484 
 3263 matty    gnome-panel                        0     5828     6731    15780 

To calculate the portion of shared memory that is being used by each process, you can add up the shared memory per process (you would probably index this by the type of shared resource), the number of processes using these pages, and then divide the two values to get a proportional value of shared memory per process. This is a very cool utility, and one that gets installed on all of my systems now!

Why isn’t Oracle using huge pages on my Redhat Linux server?

I am currently working on upgrading a number of Oracle RAC nodes from RHEL4 to RHEL5. After I upgraded the first node in the cluster, my DBA contacted me because the RHEL5 node was extremely sluggish. When I looked at top, I saw that a number of kswapd processes were consuming CPU:

$ top

top - 18:04:20 up 6 days,  3:22,  7 users,  load average: 14.25, 12.61, 14.41
Tasks: 536 total,   2 running, 533 sleeping,   0 stopped,   1 zombie
Cpu(s): 12.9%us, 19.2%sy,  0.0%ni, 20.9%id, 45.0%wa,  0.1%hi,  1.9%si,  0.0%st
Mem:  16373544k total, 16334112k used,    39432k free,     4916k buffers
Swap: 16777208k total,  2970156k used, 13807052k free,  5492216k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                        
  491 root      10  -5     0    0    0 D 55.6  0.0  67:22.85 kswapd0                         
  492 root      10  -5     0    0    0 S 25.8  0.0  37:01.75 kswapd1                         
  494 root      11  -5     0    0    0 S 24.8  0.0  42:15.31 kswapd3                         
 8730 oracle    -2   0 8352m 3.5g 3.5g S  9.9 22.4 139:36.18 oracle                          
 8726 oracle    -2   0 8352m 3.5g 3.5g S  9.6 22.5 138:13.54 oracle                          
32643 oracle    15   0 8339m  97m  92m S  9.6  0.6   0:01.31 oracle                          
  493 root      11  -5     0    0    0 S  9.3  0.0  43:11.31 kswapd2                         
 8714 oracle    -2   0 8352m 3.5g 3.5g S  9.3 22.4 137:14.96 oracle                          
 8718 oracle    -2   0 8352m 3.5g 3.5g S  8.9 22.3 137:01.91 oracle                          
19398 oracle    15   0 8340m 547m 545m R  7.9  3.4   0:05.26 oracle                          
 8722 oracle    -2   0 8352m 3.5g 3.5g S  7.6 22.5 139:18.33 oracle              

The kswapd process is responsible for scanning memory to locate free pages, and scheduling dirty pages to be written to disk. Periodic kswapd invocations are fine, but seeing kswapd continuosly appearing in the top output is a really really bad thing. Since this host should have had plenty of free memory, I was perplexed by the following output (the free output didn’t match up with the values on the other nodes):

$ free

             total       used       free     shared    buffers     cached
Mem:      16373544   16268540     105004          0       1520    5465680
-/+ buffers/cache:   10801340    5572204
Swap:     16777208    2948684   13828524

To start debugging the issue, I first looked at ipcs to see how much shared memory the database allocated. In the output below, we can see that there is a 128MB and a 8GB shared memory segment allocated:

$ ipcs -a

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x62e08f78 0          oracle    640        132120576  16                      
0xdd188948 32769      oracle    660        8592031744 87         

The first segment is dedicated to the Oracle ASM instance, and the second to the actual database. When I checked the number of huge pages allocated to the machine, I saw something a bit odd:

$ grep Huge cat /proc/meminfo

HugePages_Total:  4106
HugePages_Free:   4051
HugePages_Rsvd:      8
Hugepagesize:     2048 kB

While our DBA had set vm.nr_hugepages to a sufficiently large value in /etc/syscl.conf, the database was utilizing a very small portion of them. This meant that the database was being allocated out of non huge page memory (Linux dedicates memory to the huge page area, and it is wasted if nothing utilizes it), and inactive pages were being paged out to disk since the database wasn’t utilizing the huge page area we reserved for it . After a bit of bc’ing (I love doing my calculations with bc), I noticed that the total amount of memory allocated to huge pages was 8610906112 bytes:

$ grep vm.nr_hugepages /etc/sysctl.conf
vm.nr_hugepages=4106

$ bc
4106*(1024*1024*2)
8610906112

If we add the totals from the two shared memory segments above:

$ bc
8592031744+132120576
8724152320

We can see that we don’t have enough huge page memory to support both shared memory segments. Yikes! After adjusting vm.nr_hugepages to account for both databases, the system no longer swapped and database performance increased. This debugging adventure taught me a couple of things:

1. Double check system values people send you

2. Solaris does a MUCH better job of handling large page sizes (huge pages are used transparently)

3. The Linux tools for investigating huge page allocations are severely lacking

4. Oracle is able to allocate a continuos 8GB chunk of shared memory on RHEL5, but not RHEL4 (I need to do some research to find out why)

Hopefully more work will go into the Linux huge page implementation, and allow me to scratch the second and third items off of my list. Viva la problem resolution!

Tuning network and vm settings on CentOS Linux servers with ktune

While poking around the CentOS package repository, I came across the ktune package. Ktune comes with a set of kernel tunables that are useful for network and disk intensive workloads, and provides the ktune service to apply these settings during system startup. Ktune includes settings for TCP/IP buffers, setting the deadline scheduler as the default I/O scheduler, and entries to adjust the swappiness, dirty_ratio and pagecache settings. The full list of tunables can be viewed by paging through the following two configuration files:

$ less /etc/sysctl.ktune

$ less /etc/sysconfig/ktune

To activate the settings, you can enable the ktune service with the chkconfig and service utilities:

$ chkconfig ktune on

$ service ktune start

Saving current sysctl settings:                            [  OK  ]
Applying ktune sysctl settings from /etc/sysctl.ktune:     [  OK  ]
Applying sysctl settings from /etc/sysctl.conf:            [  OK  ]
Applying deadline elevator: sda                            [  OK  ]



This is an awesome package, and I definitely plan to use the network settings on all of my CentOS hosts.

Monitoring Linux server performance with procallator

I manage a fair number of Linux hosts, and like to keep tabs on how my systems are performing. One way I accomplish this is with procallator, which is a Perl script that collects performance data that can be graphed by orca. The graphs that orca produces are great awesome for trening server performance over time, and can be extremely valuable when debugging performance problems.

To setup procallator to collect performance data, you first need to retrieve the latest orca CVS snapshot from the orcaware snapshots directory (the procallator script is included with the orca snapshot, and the latest version contains a number of fixes). Once orca is downloaded, you will need to extract the tarball and run configure to modify the variables in the header of the procallator script:

$ tar xfj orca-snapshot-r529.tar.bz2

$ cd orca-snapshot-r529

$ ./configure –prefix=/opt/orca-r529 –with-html-dir=/opt/html

After the configure operation completes, you can install the procallator scripts with the Makefile’s install option:

$ make install

This will place the procallator perl script in $PREFIX/bin. To make sure the script starts at system boot, you can copy the $PREFIX/data_gathers/procallator/S99procallator script to /etc/rc3.d (or /etc/init.d depending on how you install your init scripts):

$ cp S99procallator /etc/rc3.d

Once these files are in place, you can start procallator by invoking the init script with the start option:

$ /etc/rc3.d/S99procallator start

This will start the procallator script as a daemon process, and the script will write performance data to the directory defined in the procallator script’s DEST_DIR variable every 5 minutes (this is tunable). The performance files will contain the name proccol-YYYY-MM-DD-INDEX, and one file will be produced each day. To graph the data in the procallator files, you can use orca and the procallator.cfg file that is in the $PREFIX/data_gathers/procallator directory. I placed a sample set of performance graphs on my website, and you can reference the article monitoring LDAP performance article for details on setting up orca to graph data. I digs me some procallator!

Speeding up Oracle disk I/O on RHEL4 systems

While poking around the web last wek, I came across a good paper from Redhat that describes how to utilize asynchronous and direct I/O with Oracle. I have been using the Oracle filesystemio_options=”SetAll” initialization parameter on a few RHEL 4 database servers to efficiently use memory, and had no idea idea that it provided the throughput numbers listed in Figure 2. Nice!