Xorg: Debugging for Dummies This minihowto attempts to explain how to debug the X server, particular in the case where the server crashes. It assumes a basic familiarity with unix and a willingness to risk deadlocking the machine. Just as a warning, if you try this with a closed-source driver, the output is not likely to be very useful. - 0x00: Prerequisites You'll really want to have a second machine around. It's impossible to debug the X server from within itself; when it stops and returns control to the debugger, you won't be able to send events to the xterm running your debugger. ssh is your friend here. If you don't have a second machine, see section 0x0A, "Debugging With One Machine". It really helps if you enable debugging symbols at build time. In host.def: #define DefaultGcc2i386Opt -O2 -g -march=pentium3 -pipe Or whatever. The -g is the important part. Much like -O, -g can take an argument specifying the amount of debugging info to emit. Valid values are -g0 through -g3, with -g2 being the default. Usually you won't need -g3, but it doesn't hurt. The higher the optimization level, the harder it is to debug; you may want to make a debugging build with -O0 -g. DO NOT build with -fomit-frame-pointer, it will produce an undebuggable server. If making a debugging build, you can install it under its own tree with #define ProjectRoot /opt/xorg-debug #define NothingOutsideProjectRoot YES in host.def. You can then switch among X builds by moving /usr/X11R6 to /usr/X11R6-release and symlinking /usr/X11R6 to the one you want. Alternatively, just stick ProjectRoot/lib in $LD_LIBRARY_PATH and ProjectRoot/bin in $PATH. (For the rest of this howto, I'll assume you installed to /opt/xorg-debug.) You will probably also want to use the dlloader instead of the elfloader for this debug build. The magic host.def spell for that is: #define MakeDllModules YES The reason is that not all versions of gdb understand the elfloader module format, but all versions of gdb understand the dlloader module format. Your gdb needs to be reasonably recent, 5.3 or better is probably good. Finally, you'll need a reproducable way of crashing the X server, but if you've read this far you've probably got that already. This is your testcase. - 0x01: The Basics Start the server normally. Go over to your second machine and ssh into the first one. su root, and type "gdb /opt/xorg-debug/Xorg `pidof Xorg`". gdb will attach to the running server and spin for a while reading in symbols from all the drivers. Eventually you'll reach a "(gdb)" prompt. Notice that the X server has halted; type "cont" at the gdb prompt to continue executing. Go back to the machine running X, and run your testcase. This time, instead of the server crashing, it should freeze, and gdb should tell you the server got a signal (usually SIGSEGV), as well as the function and line of code where the problem happened. An example looks like: Program received signal SIGSEGV, Segmentation fault. 0x403245a3 in fbBlt (srcLine=0xc1a1c180, srcStride=59742, srcX=0, dstLine=0x4240cb6c, dstStride=1152, dstX=0, width=32960, height=764, alu=-1046602744, pm=1111538028, bpp=32, reverse=0, upsidedown=0) at fbblt.c:174 174 *dst++ = FbDoDestInvarientMergeRop(*src++); This by itself is pretty helpful, but there's more info out there. At the gdb prompt, type "bt f" for a full stack backtrace. (Warning, this will be long!) This dumps out the full call chain of functions from main() on down, as well as the arguments they were called with and the value of all local variables. Keep hitting enter until you get back to the gdb prompt. Get your mouse out, copy all the output from "Program received..." on down, and paste it into a file on your second machine. Type "detach" at the gdb prompt to detach gdb from the server and let it finish crashing. Go to bugs.freedesktop.org and file a new bug describing the testcase. Attach the gdb output to the bug (please don't just paste it into the comments section). 0x02: All The gdb Commands You'll Ever Need For any gdb command, you can say "help " at the (gdb) prompt to get a (hopefully informative) explanation. - bt Prints a stack backtrace. This shows all the functions that you are currently inside, from main() on down to the point of the crash, along with their arguments. Appending the word "full" (or just the letter "f") also prints out the value of all the local variables within each function. - break / clear break sets a breakpoint. When execution reaches a breakpoint, the debugger will stop the program and return you to the gdb prompt. You can set breakpoints on functions, lines of code, or individual instructions; see the help text for details. clear, naturally, clears a breakpoint. - step / next step and next allow you to manually advance the program's execution. next runs the program until you reach a different source line; step does the same thing, but also descends into called functions. - print Prints the expression. You can specify variable names, registers, and absolute addresses, as well as more complex expressions ("help print" for details). Variable names have to be resolvable, which means they either have to be local variables within the current stack frame or global variables. Register names start with a $ sign, like "print $eax". Addresses are specified as numbers, like "print 0xdeadbeef". Expressions can be fairly complex. For example, if you have a pointer to a structure named "foo", "print s" will print the memory address that foo points to, "print *foo" will print the structure being pointed too, and "print foo->bar" will print the bar member of the foo structure. - handle Tells the debugger how to handle various signals. The defaults are mostly sensible, but there are two you may wish to change. SIGPIPE is generated when a client dies, which you may not always care about, and SIGUSR1 is generated on VT switch. By default, the debugger will halt the running process when it receives these signals; to change this, say "handle SIGPIPE nostop". - set environment Sets environment variables. The syntax is "set environment name value"; don't use an = sign like in bash, it won't do what you expect. - run Runs the program. If you only specify a program name on the command line (and not a process ID or a core file), gdb will load the program but not start running it until you say so. Arguments to "run" are passed verbatim to the child process, eg "run :0 -verbose -ac". - kill Kills the program being debugged. Not always useful, you'd often rather say... - detach which detaches the debugger from the running program, which can then shut down gracefully. - disassemble Prints the assembly instructions being executed, starting at the current source line. You can also specify absolute memory references or function names to start disassembly somewhere other than the default. Only useful if you can read the assembly language of your CPU. - 0x03: Things That Can Go Wrong The biggest thing to watch out for is attempting to print memory contents when that memory is located on the video card. It won't work, on x86 anyway, for some not-very-interesting reasons. You'll know when you did it because the machine will deadlock. Starting the server under gdb doesn't always work, and I haven't figured out why yet; probably a gdb bug. When you compile with optimization, the values printed by bt can sometimes be confusing. Some variables can get optimized out of existance, some variables occupy the same position on the stack during different parts of a function's execution, and some functions might not show up on the stack at all. Also, single-stepping can be confusing because the function might get executed in a different order than listed in the source if the compiler determines that's safe to do. - 0x0A: Debugging With One Machine If you only have one machine available, you might be able to pry some useful information from the server when it crashes. The downside is that it will probably halt your machine entirely rather than just crashing X. Edit your xorg.conf file and find the ServerFlags section. Uncomment the Option "NoTrapSignals" line (or add it if it doesn't exist). This will prevent the server from catching fatal signals, which should cause core dumps instead. (You need to make sure you have core dumps enabled for the server by removing the appropriate ulimit; see the 'ulimit' command in the bash man page for details.) The problem here is the same as mentioned earlier; the core dump will attempt to included mmap()'d sections of card memory, which will make the machine freeze. Usually the core dump is informative enough to at least give a partial backtrace. Once you've crashed the machine, find the core file and load it in gdb: gdb `which Xorg` /path/to/core/file and try to 'bt f' like normal. Fortunately at this point you can't make the machine crash again.