==> Regarding [Wlug] HOWTO debug hard lockups; Andy Stewart adds:
andystewart> HI gang,
andystewart> My dual Opteron machine is not happy. I cannot get more than
andystewart> 7 straight days of uptime without getting a hard lock,
andystewart> requiring a reboot. (My definition of hard lock is: machine
andystewart> responds neither to keyboard input, mouse input, nor network
andystewart> pings).
andystewart> I can stimulate hard locks by running OpenOffice 1.1.3 (I had
andystewart> 3 tonight, and 3-4 on a previous occasion while running
andystewart> OpenOffice). It makes no sense to me that an application run
andystewart> as a normal user could lockup a machine.
andystewart> I've tried setting "nmi_watchdog=1" to see if I could get an
andystewart> "oops" when it hard locks - no dice. Do you know any other
andystewart> tricks I could try to see if it is the kernel which is locking
andystewart> up? I'm running SuSE's version of 2.6.8.
Did you verify that NMIs are being delivered? After boot, cat
/proc/interrupts and make sure the NMI line is non-zero. Also note that,
at least with upstream and Red Hat kernels, the nmi_watchdog defaults to 1
for Opterons (i.e. you shouldn't need to manually set it).
If the NMI watchdog works, it will print a message to the console.
However, you will not see this if you are in X windows. Do you have a
serial console hooked up, by any chance? I strongly suggest it if you have
the means.
Aside from this, if it is indeed a hard lockup, there is really nothing you
can do (without purchasing other hardware to help debug the problem).
Please give these suggestions a shot and let us know how it goes.
Thanks!
-Jeff