==> Regarding Re: [Wlug] HOWTO debug hard lockups; Walt Sawyer <wsawyer@norfolk-county.com> adds: wsawyer> Andy, Are you sure it's the hard drive? Try running Knoppix live wsawyer> CD for a while. If it's still running after days(max uptime)+5, wsawyer> then perhaps it's something else. This changes things entirely. He'll be running a different kernel, so this test is not useful, in my opinion. -Jeff wsawyer> If this message is redundant, ignore it! Walt wsawyer> On Tue, 2005-06-07 at 19:19 -0400, Andy Stewart wrote:
Jeff Moyer wrote: > ==> Regarding [Wlug] HOWTO debug hard lockups; Andy Stewart <andystewart@comcast.net> adds:
andystewart> HI gang,
andystewart> My dual Opteron machine is not happy. I cannot get more
than > andystewart> 7 straight days of uptime without getting a hard lock, > andystewart> requiring a reboot. (My definition of hard lock is: machine > andystewart> responds neither to keyboard input, mouse input, nor network > andystewart> pings).
andystewart> I can stimulate hard locks by running OpenOffice 1.1.3 (I
had > andystewart> 3 tonight, and 3-4 on a previous occasion while running > andystewart> OpenOffice). It makes no sense to me that an application run > andystewart> as a normal user could lockup a machine.
andystewart> I've tried setting "nmi_watchdog=1" to see if I could get
an > andystewart> "oops" when it hard locks - no dice. Do you know any other > andystewart> tricks I could try to see if it is the kernel which is locking > andystewart> up? I'm running SuSE's version of 2.6.8.
Did you verify that NMIs are being delivered? After boot, cat >
/proc/interrupts and make sure the NMI line is non-zero. Also note that, > at least with upstream and Red Hat kernels, the nmi_watchdog defaults to 1 > for Opterons (i.e. you shouldn't need to manually set it).
HI Jeff,
Well, this is weird, I'm seeing a ZERO count for NMIs, so that makes me think they are NOT being delivered. How would I go about solving *that* little problem? I did a "cat /proc/cmdline" to insure that I had "nmi_watchdog=1" and indeed it is there. Perhaps this is the clue we've been seeking.
If the NMI watchdog works, it will print a message to the console. > However, you will not see this if you are in X windows. Do you have a > serial console hooked up, by any chance? I strongly suggest it if you have > the means.
I think I have a cable to which I could connect the serial port of the Opteron to the serial port of another Linux box (and then use minicom or some such terminal program).
Aside from this, if it is indeed a hard lockup, there is really
nothing you > can do (without purchasing other hardware to help debug the problem).
Yeah, I was afraid of that.
Please give these suggestions a shot and let us know how it goes.
Shall do - thanks, everybody!
Oh, be advised that when you smash your fist on the keyboard after your system locks up for the umpteenth time, that a lot of dead skin cells will come flying upward out of the bowels of the keyboard. I would recommend safety glasses.
- -- Andy Stewart, Founder Worcester Linux Users' Group Worcester, MA, USA http://www.wlug.org
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCpiuVHl0iXDssISsRAowjAJsHjggG0QsMPQ/H+2YQnzNZPtF9gQCfepvV u6O8n+PSW4M0I1MHXsH06Xo= =ViVg -----END PGP SIGNATURE----- _______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
wsawyer> -- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN"> wsawyer> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; wsawyer> CHARSET=UTF-8"> <META NAME="GENERATOR" CONTENT="GtkHTML/3.3.2"> wsawyer> </HEAD> <BODY> Andy,<BR> Are you sure it's the hard drive? wsawyer> Try running Knoppix live CD for a while. If it's still wsawyer> running after days(max uptime)+5, then perhaps it's something wsawyer> else.<BR> <BR> If this message is redundant, ignore it!<BR> wsawyer> Walt<BR> <BR> On Tue, 2005-06-07 at 19:19 -0400, Andy Stewart wsawyer> wrote: <BLOCKQUOTE TYPE=CITE> <PRE> <FONT wsawyer> COLOR="#000000">-----BEGIN PGP SIGNED MESSAGE-----</FONT> <FONT wsawyer> COLOR="#000000">Hash: SHA1</FONT> wsawyer> <FONT COLOR="#000000">Jeff Moyer wrote:</FONT> <FONT wsawyer> COLOR="#000000">> ==> Regarding [Wlug] HOWTO debug hard wsawyer> lockups; Andy Stewart <<A wsawyer> HREF="mailto:andystewart@comcast.net">andystewart@comcast.net</A>> wsawyer> adds:</FONT> <FONT COLOR="#000000">> </FONT> <FONT wsawyer> COLOR="#000000">> andystewart> HI gang,</FONT> <FONT wsawyer> COLOR="#000000">> </FONT> <FONT COLOR="#000000">> wsawyer> andystewart> My dual Opteron machine is not happy. I cannot wsawyer> get more than</FONT> <FONT COLOR="#000000">> andystewart> 7 wsawyer> straight days of uptime without getting a hard lock,</FONT> <FONT wsawyer> COLOR="#000000">> andystewart> requiring a reboot. (My wsawyer> definition of hard lock is: machine</FONT> <FONT wsawyer> COLOR="#000000">> andystewart> responds neither to keyboard wsawyer> input, mouse input, nor network</FONT> <FONT COLOR="#000000">> wsawyer> andystewart> pings).</FONT> <FONT COLOR="#000000">> </FONT> wsawyer> <FONT COLOR="#000000">> andystewart> I can stimulate hard wsawyer> locks by running OpenOffice 1.1.3 (I had</FONT> <FONT wsawyer> COLOR="#000000">> andystewart> 3 tonight, and 3-4 on a wsawyer> previous occasion while running</FONT> <FONT COLOR="#000000">> wsawyer> andystewart> OpenOffice). It makes no sense to me that an wsawyer> application run</FONT> <FONT COLOR="#000000">> andystewart> wsawyer> as a normal user could lockup a machine.</FONT> <FONT wsawyer> COLOR="#000000">> </FONT> <FONT COLOR="#000000">> wsawyer> andystewart> I've tried setting "nmi_watchdog=1" to wsawyer> see if I could get an</FONT> <FONT COLOR="#000000">> wsawyer> andystewart> "oops" when it hard locks - no dice. Do wsawyer> you know any other</FONT> <FONT COLOR="#000000">> wsawyer> andystewart> tricks I could try to see if it is the kernel wsawyer> which is locking</FONT> <FONT COLOR="#000000">> andystewart> wsawyer> up? I'm running SuSE's version of 2.6.8.</FONT> <FONT wsawyer> COLOR="#000000">> </FONT> <FONT COLOR="#000000">> Did you wsawyer> verify that NMIs are being delivered? After boot, cat</FONT> wsawyer> <FONT COLOR="#000000">> /proc/interrupts and make sure the NMI wsawyer> line is non-zero. Also note that,</FONT> <FONT wsawyer> COLOR="#000000">> at least with upstream and Red Hat kernels, wsawyer> the nmi_watchdog defaults to 1</FONT> <FONT COLOR="#000000">> wsawyer> for Opterons (i.e. you shouldn't need to manually set it).</FONT> wsawyer> <FONT COLOR="#000000">HI Jeff,</FONT> wsawyer> <FONT COLOR="#000000">Well, this is weird, I'm seeing a ZERO count wsawyer> for NMIs, so that makes me</FONT> <FONT COLOR="#000000">think they wsawyer> are NOT being delivered. How would I go about solving wsawyer> *that*</FONT> <FONT COLOR="#000000">little problem? I did a wsawyer> "cat /proc/cmdline" to insure that I had</FONT> <FONT wsawyer> COLOR="#000000">"nmi_watchdog=1" and indeed it is there. wsawyer> Perhaps this is the clue we've</FONT> <FONT COLOR="#000000">been wsawyer> seeking.</FONT> wsawyer> <FONT COLOR="#000000">> If the NMI watchdog works, it will wsawyer> print a message to the console.</FONT> <FONT COLOR="#000000">> wsawyer> However, you will not see this if you are in X windows. Do you wsawyer> have a</FONT> <FONT COLOR="#000000">> serial console hooked up, wsawyer> by any chance? I strongly suggest it if you have</FONT> <FONT wsawyer> COLOR="#000000">> the means.</FONT> wsawyer> <FONT COLOR="#000000">I think I have a cable to which I could wsawyer> connect the serial port of the</FONT> <FONT wsawyer> COLOR="#000000">Opteron to the serial port of another Linux box wsawyer> (and then use minicom or</FONT> <FONT COLOR="#000000">some such wsawyer> terminal program).</FONT> wsawyer> <FONT COLOR="#000000">> </FONT> <FONT COLOR="#000000">> wsawyer> Aside from this, if it is indeed a hard lockup, there is really wsawyer> nothing you</FONT> <FONT COLOR="#000000">> can do (without wsawyer> purchasing other hardware to help debug the problem).</FONT> wsawyer> <FONT COLOR="#000000">Yeah, I was afraid of that.</FONT> wsawyer> <FONT COLOR="#000000">> </FONT> <FONT COLOR="#000000">> wsawyer> Please give these suggestions a shot and let us know how it wsawyer> goes.</FONT> wsawyer> <FONT COLOR="#000000">Shall do - thanks, everybody!</FONT> wsawyer> <FONT COLOR="#000000">Oh, be advised that when you smash your fist wsawyer> on the keyboard after your</FONT> <FONT COLOR="#000000">system wsawyer> locks up for the umpteenth time, that a lot of dead skin wsawyer> cells</FONT> <FONT COLOR="#000000">will come flying upward out of wsawyer> the bowels of the keyboard. I would</FONT> <FONT wsawyer> COLOR="#000000">recommend safety glasses.</FONT> wsawyer> <FONT COLOR="#000000">Later,</FONT> wsawyer> <FONT COLOR="#000000">Andy</FONT> wsawyer> <FONT COLOR="#000000">- --</FONT> <FONT COLOR="#000000">Andy wsawyer> Stewart, Founder</FONT> <FONT COLOR="#000000">Worcester Linux wsawyer> Users' Group</FONT> <FONT COLOR="#000000">Worcester, MA, wsawyer> USA</FONT> <FONT COLOR="#000000"><A wsawyer> HREF="http://www.wlug.org">http://www.wlug.org</A></FONT> wsawyer> <FONT COLOR="#000000">-----BEGIN PGP SIGNATURE-----</FONT> <FONT wsawyer> COLOR="#000000">Version: GnuPG v1.2.5 (GNU/Linux)</FONT> <FONT wsawyer> COLOR="#000000">Comment: Using GnuPG with Thunderbird - <A wsawyer> HREF="http://enigmail.mozdev.org">http://enigmail.mozdev.org</A></FONT> wsawyer> <FONT wsawyer> COLOR="#000000">iD8DBQFCpiuVHl0iXDssISsRAowjAJsHjggG0QsMPQ/H+2YQnzNZPtF9gQCfepvV</FONT> wsawyer> <FONT COLOR="#000000">u6O8n+PSW4M0I1MHXsH06Xo=</FONT> <FONT wsawyer> COLOR="#000000">=ViVg</FONT> <FONT COLOR="#000000">-----END PGP wsawyer> SIGNATURE-----</FONT> <FONT wsawyer> COLOR="#000000">_______________________________________________</FONT> wsawyer> <FONT COLOR="#000000">Wlug mailing list</FONT> <FONT wsawyer> COLOR="#000000"><A wsawyer> HREF="mailto:Wlug@mail.wlug.org">Wlug@mail.wlug.org</A></FONT> wsawyer> <FONT COLOR="#000000"><A wsawyer> HREF="http://mail.wlug.org/mailman/listinfo/wlug">http://mail.wlug.org/mailman/listinfo/wlug</A></FONT> wsawyer> </PRE> </BLOCKQUOTE> <TABLE CELLSPACING="0" CELLPADDING="0" wsawyer> WIDTH="100%"> <TR> <TD> <PRE> -- </PRE> </TD> </TR> </TABLE> wsawyer> </BODY> </HTML> _______________________________________________ wsawyer> Wlug mailing list Wlug@mail.wlug.org wsawyer> http://mail.wlug.org/mailman/listinfo/wlug