Root causing a bug in the Solaris display_stack_info() function


I recently came across the Apache backtrace module, and thought it would be fun to get the module working with Solaris since it only supported Linux and FreeBSD. Since Solaris provides the printstack() function to view stack frames up to a specific point in time, I thought I would be able to add the following #ifdef/#endif in mod-backtrace’s exception hook, and all would be swell:

#if defined(__linux__) || defined(__FreeBSD__)
size = backtrace(array, sizeof array / sizeof array[0]);
backtrace_symbols_fd(array, size, logfd);
#elif defined(__SVR4) && defined(__sun)
printstack(logfd);
#endif

After adding the code and building the module, I quickly learned that the Solaris libc display_stack_info() function was causing a segmentation violation each time I sent a SIGSEGV to one of the child processes (see the irony here?):

$ gdb bin/httpd core

(gdb) where

#0 0xfef68814 in display_stack_info () from /lib/libc.so.1
#1 0xfef683f8 in walkcontext () from /lib/libc.so.1
#2 0xfef68950 in printstack () from /lib/libc.so.1
#3 0xfea309dc in bt_exception_hook (ei=0xffbff2e8) at mod-backtrace.c:149
#4 0x00039e7c in ap_run_fatal_exception (ei=0xffbff2e8) at mpm_common.c:1007
#5 0x00039f60 in sig_coredump (sig=11) at mpm_common.c:1019
#6
#7 0xfefc0318 in _so_accept () from /lib/libc.so.1
#8 0xff2c2c58 in apr_socket_accept (new=0xffbff7ec, sock=0x77ea8, connection_context=0x12c5b8) at sockets.c:164
#9 0x0004924c in unixd_accept (accepted=0xffbff864, lr=0x12a630, ptrans=0x12c5b8) at unixd.c:468
#10 0x0002c1f8 in child_main (child_num_arg=449536) at prefork.c:591
#11 0x0002c4c0 in make_child (s=0x2bc00, slot=0) at prefork.c:704
#12 0x0002c590 in startup_children (number_to_start=5) at prefork.c:722
#13 0x0002cd18 in ap_mpm_run (_pconf=0x79b00, plog=0x6dc00, s=0x5) at prefork.c:941
#14 0x00032bfc in main (argc=3, argv=0xffbffb94) at main.c:618

Since this issue was preventing my patch from working, I snagged Clay, and we started poking around the display_stack_info code to see what was wrong. The first item we checked was the instruction that cause the program to bomb:

(gdb) **info reg pc**

pc 0xfef68814 0xfef68814

(gdb) **x/i 0xfef68814**

0xfef68814 : ld [ %i4 + 8 ], %i3

(gdb) **disassemble 0xfef68800 0xfef68818**

Dump of assembler code from 0xfef68800 to 0xfef68818:
0xfef68800 : add %fp, -36, %o4
0xfef68804 : ret
0xfef68808 : restore %g0, 0, %o0
0xfef6880c : ld [ %fp + -56 ], %i4
0xfef68810 : ld [ %fp + -40 ], %g1
0xfef68814 : ld [ %i4 + 8 ], %i3
End of assembler dump.

The program counter (%pc) indicates that the instruction at address 0xfef68814 is to blame. The instruction at address 0xfef68814 is attempting to add 8-bytes to the address stored in register %i4 (which is loaded two instructions earlier from %fp - 56), and load the value at that address into the register %i3. To see which address was stored in register %i4, we ran the ‘info reg’ command:

(gdb) **info reg $i4**
i4 0x0 0

Well that’s not good. The load instruction is attempting to load a value form address 0x0 + 8 into %i3. That address is clearly not mapped into our address space, and is causing Apache to incur a segmentation violation. To find the value that should be in %i4, we worked backwards in the disassemble output until we came to the following set of instructions:

(gdb) **disassemble display_stack_info**

0xfef687a0 : ld [ %l7 + 0x774 ], %o1
0xfef687a4 : add %fp, -56, %o2
0xfef687a8 : ld [ %fp + 0x44 ], %o0
0xfef687ac : add %fp, -52, %o1
0xfef687b0 : call 0xfefea340

The instruction at 0xfef687a4 also attempts to load the address at %fp - 56 into a register (%o2 in this case). Since dladdr1() is called two instructions later, and since output registers are used to pass arguments to functions, we decided to check the opensolaris source repository to see which variable was passed as the third argument to dladdr1() in display_stack_info():

473 Sym *sym;

482 if (dladdr1((void *) pc, &info, (void**) &sym, RTLD_DL_SYMENT) == 0) {
484 if (signo == 0)
485 async_filenoprintf(filenum, "0x%xn", pc);
486 else
487 async_filenoprintf(filenum,
488 "0x%x [ Signal %d (%s)]n", pc,
489 (ulong_t)signo, sigbuf);
490
491 } else if ((pc - (unsigned long)info.dli_saddr) <
492 sym->st_size) {

According to the source code (isn’t it awesome that the Solaris source code is available?), it looks like sym is the third argument passed to dladdr1(). Since sym was not assigned a value in dladdr1(), the else if statement that attempts to dereference sym just after the call to dladdr1() was causing the segmentation violation:

491 } else if ((pc - (unsigned long)info.dli_saddr) <
492 sym->st_size) {

I chatted with Clay regarding this issue, and he was able to track the bug to the _dladdr1() function. Clay filed CR #6372620 with a test case and fix, and we are hopeful that this will be integrated into Solaris in the coming months. It can be somewhat overwhleming to dig through the mounds of code on opensolaris.org, but hopefully this will become easier with time. Hopefully Clay will post a BLOG entry with more info on the _dladdr1() bug (this is actually a neat bug, and hits one of those interesting edge conditions that are hard to think up).

This article was posted by Matty on 2006-01-14 23:35:00 -0400 -0400