Andy,
Are you sure it's the hard drive?  Try running Knoppix live CD for a while.  If it's still running after days(max uptime)+5, then perhaps it's something else.

If this message is redundant, ignore it!
Walt

On Tue, 2005-06-07 at 19:19 -0400, Andy Stewart wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff Moyer wrote:
> ==> Regarding [Wlug] HOWTO debug hard lockups; Andy Stewart <andystewart@comcast.net> adds:
> 
> andystewart> HI gang,
> 
> andystewart> My dual Opteron machine is not happy.  I cannot get more than
> andystewart> 7 straight days of uptime without getting a hard lock,
> andystewart> requiring a reboot.  (My definition of hard lock is: machine
> andystewart> responds neither to keyboard input, mouse input, nor network
> andystewart> pings).
> 
> andystewart> I can stimulate hard locks by running OpenOffice 1.1.3 (I had
> andystewart> 3 tonight, and 3-4 on a previous occasion while running
> andystewart> OpenOffice).  It makes no sense to me that an application run
> andystewart> as a normal user could lockup a machine.
> 
> andystewart> I've tried setting "nmi_watchdog=1" to see if I could get an
> andystewart> "oops" when it hard locks - no dice.  Do you know any other
> andystewart> tricks I could try to see if it is the kernel which is locking
> andystewart> up?  I'm running SuSE's version of 2.6.8.
> 
> Did you verify that NMIs are being delivered?  After boot, cat
> /proc/interrupts and make sure the NMI line is non-zero.  Also note that,
> at least with upstream and Red Hat kernels, the nmi_watchdog defaults to 1
> for Opterons (i.e. you shouldn't need to manually set it).

HI Jeff,

Well, this is weird, I'm seeing a ZERO count for NMIs, so that makes me
think they are NOT being delivered.  How would I go about solving *that*
little problem?  I did a "cat /proc/cmdline" to insure that I had
"nmi_watchdog=1" and indeed it is there.  Perhaps this is the clue we've
been seeking.

> If the NMI watchdog works, it will print a message to the console.
> However, you will not see this if you are in X windows.  Do you have a
> serial console hooked up, by any chance?  I strongly suggest it if you have
> the means.

I think I have a cable to which I could connect the serial port of the
Opteron to the serial port of another Linux box (and then use minicom or
some such terminal program).

> 
> Aside from this, if it is indeed a hard lockup, there is really nothing you
> can do (without purchasing other hardware to help debug the problem).

Yeah, I was afraid of that.

> 
> Please give these suggestions a shot and let us know how it goes.

Shall do - thanks, everybody!

Oh, be advised that when you smash your fist on the keyboard after your
system locks up for the umpteenth time, that a lot of dead skin cells
will come flying upward out of the bowels of the keyboard.  I would
recommend safety glasses.

Later,

Andy

- --
Andy Stewart, Founder
Worcester Linux Users' Group
Worcester, MA, USA
http://www.wlug.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCpiuVHl0iXDssISsRAowjAJsHjggG0QsMPQ/H+2YQnzNZPtF9gQCfepvV
u6O8n+PSW4M0I1MHXsH06Xo=
=ViVg
-----END PGP SIGNATURE-----
_______________________________________________
Wlug mailing list
Wlug@mail.wlug.org
http://mail.wlug.org/mailman/listinfo/wlug

--