I am currently working on upgrading a number of Oracle RAC nodes from RHEL4 to RHEL5. After I upgraded the first node in the cluster, my DBA contacted me because the RHEL5 node was extremely sluggish. When I looked at top, I saw that a number of kswapd processes were consuming CPU:
$ top
top - 18:04:20 up 6 days, 3:22, 7 users, load average: 14.25, 12.61, 14.41
Tasks: 536 total, 2 running, 533 sleeping, 0 stopped, 1 zombie
Cpu(s): 12.9%us, 19.2%sy, 0.0%ni, 20.9%id, 45.0%wa, 0.1%hi, 1.9%si, 0.0%st
Mem: 16373544k total, 16334112k used, 39432k free, 4916k buffers
Swap: 16777208k total, 2970156k used, 13807052k free, 5492216k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
491 root 10 -5 0 0 0 D 55.6 0.0 67:22.85 kswapd0
492 root 10 -5 0 0 0 S 25.8 0.0 37:01.75 kswapd1
494 root 11 -5 0 0 0 S 24.8 0.0 42:15.31 kswapd3
8730 oracle -2 0 8352m 3.5g 3.5g S 9.9 22.4 139:36.18 oracle
8726 oracle -2 0 8352m 3.5g 3.5g S 9.6 22.5 138:13.54 oracle
32643 oracle 15 0 8339m 97m 92m S 9.6 0.6 0:01.31 oracle
493 root 11 -5 0 0 0 S 9.3 0.0 43:11.31 kswapd2
8714 oracle -2 0 8352m 3.5g 3.5g S 9.3 22.4 137:14.96 oracle
8718 oracle -2 0 8352m 3.5g 3.5g S 8.9 22.3 137:01.91 oracle
19398 oracle 15 0 8340m 547m 545m R 7.9 3.4 0:05.26 oracle
8722 oracle -2 0 8352m 3.5g 3.5g S 7.6 22.5 139:18.33 oracle
The kswapd process is responsible for scanning memory to locate free pages, and scheduling dirty pages to be written to disk. Periodic kswapd invocations are fine, but seeing kswapd continuosly appearing in the top output is a really really bad thing. Since this host should have had plenty of free memory, I was perplexed by the following output (the free output didn’t match up with the values on the other nodes):
$ free
total used free shared buffers cached
Mem: 16373544 16268540 105004 0 1520 5465680
-/+ buffers/cache: 10801340 5572204
Swap: 16777208 2948684 13828524
To start debugging the issue, I first looked at ipcs to see how much shared memory the database allocated. In the output below, we can see that there is a 128MB and a 8GB shared memory segment allocated:
$ ipcs -a
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x62e08f78 0 oracle 640 132120576 16
0xdd188948 32769 oracle 660 8592031744 87
The first segment is dedicated to the Oracle ASM instance, and the second to the actual database. When I checked the number of huge pages allocated to the machine, I saw something a bit odd:
$ grep Huge cat /proc/meminfo
HugePages_Total: 4106
HugePages_Free: 4051
HugePages_Rsvd: 8
Hugepagesize: 2048 kB
While our DBA had set vm.nr_hugepages to a sufficiently large value in /etc/syscl.conf, the database was utilizing a very small portion of them. This meant that the database was being allocated out of non huge page memory (Linux dedicates memory to the huge page area, and it is wasted if nothing utilizes it), and inactive pages were being paged out to disk since the database wasn’t utilizing the huge page area we reserved for it . After a bit of bc’ing (I love doing my calculations with bc), I noticed that the total amount of memory allocated to huge pages was 8610906112 bytes:
$ grep vm.nr_hugepages /etc/sysctl.conf
vm.nr_hugepages=4106
$ bc
4106*(1024*1024*2)
8610906112
If we add the totals from the two shared memory segments above:
$ bc
8592031744+132120576
8724152320
We can see that we don’t have enough huge page memory to support both shared memory segments. Yikes! After adjusting vm.nr_hugepages to account for both databases, the system no longer swapped and database performance increased. This debugging adventure taught me a couple of things:
Hopefully more work will go into the Linux huge page implementation, and allow me to scratch the second and third items off of my list. Viva la problem resolution!