Why isn’t Oracle using huge pages on my Redhat Linux server?

I am currently working on upgrading a number of Oracle RAC nodes from RHEL4 to RHEL5. After I upgraded the first node in the cluster, my DBA contacted me because the RHEL5 node was extremely sluggish. When I looked at top, I saw that a number of kswapd processes were consuming CPU:

$ top

top - 18:04:20 up 6 days,  3:22,  7 users,  load average: 14.25, 12.61, 14.41
Tasks: 536 total,   2 running, 533 sleeping,   0 stopped,   1 zombie
Cpu(s): 12.9%us, 19.2%sy,  0.0%ni, 20.9%id, 45.0%wa,  0.1%hi,  1.9%si,  0.0%st
Mem:  16373544k total, 16334112k used,    39432k free,     4916k buffers
Swap: 16777208k total,  2970156k used, 13807052k free,  5492216k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                        
  491 root      10  -5     0    0    0 D 55.6  0.0  67:22.85 kswapd0                         
  492 root      10  -5     0    0    0 S 25.8  0.0  37:01.75 kswapd1                         
  494 root      11  -5     0    0    0 S 24.8  0.0  42:15.31 kswapd3                         
 8730 oracle    -2   0 8352m 3.5g 3.5g S  9.9 22.4 139:36.18 oracle                          
 8726 oracle    -2   0 8352m 3.5g 3.5g S  9.6 22.5 138:13.54 oracle                          
32643 oracle    15   0 8339m  97m  92m S  9.6  0.6   0:01.31 oracle                          
  493 root      11  -5     0    0    0 S  9.3  0.0  43:11.31 kswapd2                         
 8714 oracle    -2   0 8352m 3.5g 3.5g S  9.3 22.4 137:14.96 oracle                          
 8718 oracle    -2   0 8352m 3.5g 3.5g S  8.9 22.3 137:01.91 oracle                          
19398 oracle    15   0 8340m 547m 545m R  7.9  3.4   0:05.26 oracle                          
 8722 oracle    -2   0 8352m 3.5g 3.5g S  7.6 22.5 139:18.33 oracle              

The kswapd process is responsible for scanning memory to locate free pages, and scheduling dirty pages to be written to disk. Periodic kswapd invocations are fine, but seeing kswapd continuosly appearing in the top output is a really really bad thing. Since this host should have had plenty of free memory, I was perplexed by the following output (the free output didn’t match up with the values on the other nodes):

$ free

             total       used       free     shared    buffers     cached
Mem:      16373544   16268540     105004          0       1520    5465680
-/+ buffers/cache:   10801340    5572204
Swap:     16777208    2948684   13828524

To start debugging the issue, I first looked at ipcs to see how much shared memory the database allocated. In the output below, we can see that there is a 128MB and a 8GB shared memory segment allocated:

$ ipcs -a

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x62e08f78 0          oracle    640        132120576  16                      
0xdd188948 32769      oracle    660        8592031744 87         

The first segment is dedicated to the Oracle ASM instance, and the second to the actual database. When I checked the number of huge pages allocated to the machine, I saw something a bit odd:

$ grep Huge cat /proc/meminfo

HugePages_Total:  4106
HugePages_Free:   4051
HugePages_Rsvd:      8
Hugepagesize:     2048 kB

While our DBA had set vm.nr_hugepages to a sufficiently large value in /etc/syscl.conf, the database was utilizing a very small portion of them. This meant that the database was being allocated out of non huge page memory (Linux dedicates memory to the huge page area, and it is wasted if nothing utilizes it), and inactive pages were being paged out to disk since the database wasn’t utilizing the huge page area we reserved for it . After a bit of bc’ing (I love doing my calculations with bc), I noticed that the total amount of memory allocated to huge pages was 8610906112 bytes:

$ grep vm.nr_hugepages /etc/sysctl.conf

$ bc

If we add the totals from the two shared memory segments above:

$ bc

We can see that we don’t have enough huge page memory to support both shared memory segments. Yikes! After adjusting vm.nr_hugepages to account for both databases, the system no longer swapped and database performance increased. This debugging adventure taught me a couple of things:

1. Double check system values people send you

2. Solaris does a MUCH better job of handling large page sizes (huge pages are used transparently)

3. The Linux tools for investigating huge page allocations are severely lacking

4. Oracle is able to allocate a continuos 8GB chunk of shared memory on RHEL5, but not RHEL4 (I need to do some research to find out why)

Hopefully more work will go into the Linux huge page implementation, and allow me to scratch the second and third items off of my list. Viva la problem resolution!