Andy, Are you sure it's the hard drive? Try running Knoppix live CD for a while. If it's still running after days(max uptime)+5, then perhaps it's something else. If this message is redundant, ignore it! Walt On Tue, 2005-06-07 at 19:19 -0400, Andy Stewart wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeff Moyer wrote:
==> Regarding [Wlug] HOWTO debug hard lockups; Andy Stewart <andystewart@comcast.net> adds:
andystewart> HI gang,
andystewart> My dual Opteron machine is not happy. I cannot get more than andystewart> 7 straight days of uptime without getting a hard lock, andystewart> requiring a reboot. (My definition of hard lock is: machine andystewart> responds neither to keyboard input, mouse input, nor network andystewart> pings).
andystewart> I can stimulate hard locks by running OpenOffice 1.1.3 (I had andystewart> 3 tonight, and 3-4 on a previous occasion while running andystewart> OpenOffice). It makes no sense to me that an application run andystewart> as a normal user could lockup a machine.
andystewart> I've tried setting "nmi_watchdog=1" to see if I could get an andystewart> "oops" when it hard locks - no dice. Do you know any other andystewart> tricks I could try to see if it is the kernel which is locking andystewart> up? I'm running SuSE's version of 2.6.8.
Did you verify that NMIs are being delivered? After boot, cat /proc/interrupts and make sure the NMI line is non-zero. Also note that, at least with upstream and Red Hat kernels, the nmi_watchdog defaults to 1 for Opterons (i.e. you shouldn't need to manually set it).
HI Jeff,
Well, this is weird, I'm seeing a ZERO count for NMIs, so that makes me think they are NOT being delivered. How would I go about solving *that* little problem? I did a "cat /proc/cmdline" to insure that I had "nmi_watchdog=1" and indeed it is there. Perhaps this is the clue we've been seeking.
If the NMI watchdog works, it will print a message to the console. However, you will not see this if you are in X windows. Do you have a serial console hooked up, by any chance? I strongly suggest it if you have the means.
I think I have a cable to which I could connect the serial port of the Opteron to the serial port of another Linux box (and then use minicom or some such terminal program).
Aside from this, if it is indeed a hard lockup, there is really nothing you can do (without purchasing other hardware to help debug the problem).
Yeah, I was afraid of that.
Please give these suggestions a shot and let us know how it goes.
Shall do - thanks, everybody!
Oh, be advised that when you smash your fist on the keyboard after your system locks up for the umpteenth time, that a lot of dead skin cells will come flying upward out of the bowels of the keyboard. I would recommend safety glasses.
Later,
Andy
- -- Andy Stewart, Founder Worcester Linux Users' Group Worcester, MA, USA http://www.wlug.org
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCpiuVHl0iXDssISsRAowjAJsHjggG0QsMPQ/H+2YQnzNZPtF9gQCfepvV u6O8n+PSW4M0I1MHXsH06Xo= =ViVg -----END PGP SIGNATURE----- _______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
--