Why isn't Oracle using huge pages on my Redhat Linux server? -- Prefetch Technologies

I am currently working on upgrading a number of Oracle RAC nodes from RHEL4 to RHEL5. After I upgraded the first node in the cluster, my DBA contacted me because the RHEL5 node was extremely sluggish. When I looked at top, I saw that a number of kswapd processes were consuming CPU:

$ top

top - 18:04:20 up 6 days, 3:22, 7 users, load average: 14.25, 12.61, 14.41
Tasks: 536 total, 2 running, 533 sleeping, 0 stopped, 1 zombie
Cpu(s): 12.9%us, 19.2%sy, 0.0%ni, 20.9%id, 45.0%wa, 0.1%hi, 1.9%si, 0.0%st
Mem: 16373544k total, 16334112k used, 39432k free, 4916k buffers
Swap: 16777208k total, 2970156k used, 13807052k free, 5492216k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
491 root 10 -5 0 0 0 D 55.6 0.0 67:22.85 kswapd0
492 root 10 -5 0 0 0 S 25.8 0.0 37:01.75 kswapd1
494 root 11 -5 0 0 0 S 24.8 0.0 42:15.31 kswapd3
8730 oracle -2 0 8352m 3.5g 3.5g S 9.9 22.4 139:36.18 oracle
8726 oracle -2 0 8352m 3.5g 3.5g S 9.6 22.5 138:13.54 oracle
32643 oracle 15 0 8339m 97m 92m S 9.6 0.6 0:01.31 oracle
493 root 11 -5 0 0 0 S 9.3 0.0 43:11.31 kswapd2
8714 oracle -2 0 8352m 3.5g 3.5g S 9.3 22.4 137:14.96 oracle
8718 oracle -2 0 8352m 3.5g 3.5g S 8.9 22.3 137:01.91 oracle
19398 oracle 15 0 8340m 547m 545m R 7.9 3.4 0:05.26 oracle
8722 oracle -2 0 8352m 3.5g 3.5g S 7.6 22.5 139:18.33 oracle

The kswapd process is responsible for scanning memory to locate free pages, and scheduling dirty pages to be written to disk. Periodic kswapd invocations are fine, but seeing kswapd continuosly appearing in the top output is a really really bad thing. Since this host should have had plenty of free memory, I was perplexed by the following output (the free output didn’t match up with the values on the other nodes):

$ free

total used free shared buffers cached
Mem: 16373544 16268540 105004 0 1520 5465680
-/+ buffers/cache: 10801340 5572204
Swap: 16777208 2948684 13828524

To start debugging the issue, I first looked at ipcs to see how much shared memory the database allocated. In the output below, we can see that there is a 128MB and a 8GB shared memory segment allocated:

$ ipcs -a

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x62e08f78 0 oracle 640 132120576 16
0xdd188948 32769 oracle 660 8592031744 87

The first segment is dedicated to the Oracle ASM instance, and the second to the actual database. When I checked the number of huge pages allocated to the machine, I saw something a bit odd:

$ grep Huge cat /proc/meminfo

HugePages_Total: 4106
HugePages_Free: 4051
HugePages_Rsvd: 8
Hugepagesize: 2048 kB

While our DBA had set vm.nr_hugepages to a sufficiently large value in /etc/syscl.conf, the database was utilizing a very small portion of them. This meant that the database was being allocated out of non huge page memory (Linux dedicates memory to the huge page area, and it is wasted if nothing utilizes it), and inactive pages were being paged out to disk since the database wasn’t utilizing the huge page area we reserved for it . After a bit of bc’ing (I love doing my calculations with bc), I noticed that the total amount of memory allocated to huge pages was 8610906112 bytes:

$ grep vm.nr_hugepages /etc/sysctl.conf
vm.nr_hugepages=4106

$ bc
4106*(1024*1024*2) 8610906112

If we add the totals from the two shared memory segments above:

$ bc
8592031744+132120576 8724152320

We can see that we don’t have enough huge page memory to support both shared memory segments. Yikes! After adjusting vm.nr_hugepages to account for both databases, the system no longer swapped and database performance increased. This debugging adventure taught me a couple of things:

Double check system values people send you
Solaris does a MUCH better job of handling large page sizes (huge pages are used transparently)
The Linux tools for investigating huge page allocations are severely lacking
Oracle is able to allocate a continuos 8GB chunk of shared memory on RHEL5, but not RHEL4 (I need to do some research to find out why)

Hopefully more work will go into the Linux huge page implementation, and allow me to scratch the second and third items off of my list. Viva la problem resolution!

This article was posted by Matty on 2010-02-11 12:55:00 -0400 -0400