==> Regarding Re: [Wlug] HOWTO debug hard lockups; Andy Stewart <andystewart@comcast.net> adds: andystewart> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 andystewart> Jeff Moyer wrote:
==> Regarding [Wlug] HOWTO debug hard lockups; Andy Stewart <andystewart@comcast.net> adds:
andystewart> HI gang,
andystewart> My dual Opteron machine is not happy. I cannot get more than andystewart> 7 straight days of uptime without getting a hard lock, andystewart> requiring a reboot. (My definition of hard lock is: machine andystewart> responds neither to keyboard input, mouse input, nor network andystewart> pings).
andystewart> I can stimulate hard locks by running OpenOffice 1.1.3 (I had andystewart> 3 tonight, and 3-4 on a previous occasion while running andystewart> OpenOffice). It makes no sense to me that an application run andystewart> as a normal user could lockup a machine.
andystewart> I've tried setting "nmi_watchdog=1" to see if I could get an andystewart> "oops" when it hard locks - no dice. Do you know any other andystewart> tricks I could try to see if it is the kernel which is locking andystewart> up? I'm running SuSE's version of 2.6.8.
Did you verify that NMIs are being delivered? After boot, cat /proc/interrupts and make sure the NMI line is non-zero. Also note that, at least with upstream and Red Hat kernels, the nmi_watchdog defaults to 1 for Opterons (i.e. you shouldn't need to manually set it).
andystewart> HI Jeff, andystewart> Well, this is weird, I'm seeing a ZERO count for NMIs, so that andystewart> makes me think they are NOT being delivered. How would I go andystewart> about solving *that* little problem? I did a "cat Well, you can try booting with nmi_watchdog=2. This will try to use the local APIC to deliver nmi's, but I haven't actually seen a dual processor system that required this (all of them I've seen work with nmi_watchdog=1). It is worth a try, however. andystewart> /proc/cmdline" to insure that I had "nmi_watchdog=1" and andystewart> indeed it is there. Perhaps this is the clue we've been andystewart> seeking. Well, it only tells you that the nmi_watchdog won't trigger. We still have no insight into what the problem might actually be.
If the NMI watchdog works, it will print a message to the console. However, you will not see this if you are in X windows. Do you have a serial console hooked up, by any chance? I strongly suggest it if you have the means.
andystewart> I think I have a cable to which I could connect the serial andystewart> port of the Opteron to the serial port of another Linux box andystewart> (and then use minicom or some such terminal program).
Aside from this, if it is indeed a hard lockup, there is really nothing you can do (without purchasing other hardware to help debug the problem).
andystewart> Yeah, I was afraid of that.
Please give these suggestions a shot and let us know how it goes.
andystewart> Shall do - thanks, everybody! andystewart> Oh, be advised that when you smash your fist on the keyboard andystewart> after your system locks up for the umpteenth time, that a lot andystewart> of dead skin cells will come flying upward out of the bowels andystewart> of the keyboard. I would recommend safety glasses. Noted. =) -Jeff