Stupid Hardware Clock Tricks, aka: WTF?
Over this past weekend I upgraded my server from a P200/RedHat Linux 7.2 to a new Athlon 850/RedHat Linux 7.3. The Athlon was my trusty workstation for many a year. Everything's great, except for one problem: the clock goes crazy! It seems to be a once per second issue: $ perl -e '$t=time;while(1){$t2=time; warn join("\n",scalar localtime($t),scalar localtime($t2),"") if ($t2-$t>4); $t=$t2;}' Wed Aug 21 17:17:28 2002 Wed Aug 21 18:29:03 2002 Wed Aug 21 17:17:29 2002 Wed Aug 21 18:29:04 2002 Wed Aug 21 17:17:30 2002 Wed Aug 21 18:29:05 2002 I use netsaint to monitor system health, and it keeps complaining about the time jumps: Aug 21 18:28:03 eclectic netsaint: Warning: A system time change of 4296 seconds (forwards in time) has been detected. Compensating... Aug 21 17:16:31 eclectic netsaint: Warning: A system time change of 4293 seconds (backwards in time) has been detected. Compensating... Aug 21 18:29:42 eclectic netsaint: Warning: A system time change of 4296 seconds (forwards in time) has been detected. Compensating... Aug 21 17:18:10 eclectic netsaint: Warning: A system time change of 4293 seconds (backwards in time) has been detected. Compensating... Of course, random checks start failing as well since they occasionally run during the time jump, and if they're checking for time since something occured (log entry, cron run, query response time, etc) ... I've tried "hwclock --systohc" to force the hardware clock to the same time as the system, didn't help. I'm running NTP and thought maybe it was having problems, so I tried turning that off, didn't help. I went through BIOS and tweaked memory/IRQ settings -- I didn't think it would really solve anything, but the setting changes and reboot kept the behavior away for ~14-15 hours. So I'm completely stumped as to what causes this problem. Has anyone seen anything like this before? Any suggestions for things to look at? I'm at the point where I'm planning to just buy a new bit of hardware, migrate the drives and such again, and go from there. With my luck the problem would follow me of course... :( Thanks in advance. :) BTW: If anyone is wondering, I'm running SGI's XFS patched 2.4.18 kernel: 2.4.18-SGI_XFS_1.1. I'm thinking about dropping back to a 2.4.9 kernel w/ XFS support and seeing if that works any better. :( I was running 2.4.18 on the P200 for quite a while though without problem. -- Randomly Generated Tagline: I don't want to look like a weirdo. I'll just go with a muumuu. -- Homer Simpson King-Size Homer
i run 2.4.1{6,8}-xfs w/out any time issues (so as to dispell the source of the problem being kernel) -mike ----- Original Message ----- From: "Theo Van Dinter" <felicity@kluge.net> To: "Worcester Linux User's Group" <wlug@wlug.org> Sent: Wednesday, August 21, 2002 17:31 Subject: [Wlug] Stupid Hardware Clock Tricks, aka: WTF?
Over this past weekend I upgraded my server from a P200/RedHat Linux 7.2 to a new Athlon 850/RedHat Linux 7.3. The Athlon was my trusty workstation for many a year. Everything's great, except for one problem: the clock goes crazy!
It seems to be a once per second issue:
$ perl -e '$t=time;while(1){$t2=time; warn join("\n",scalar localtime($t),scalar localtime($t2),"") if ($t2-$t>4); $t=$t2;}' Wed Aug 21 17:17:28 2002 Wed Aug 21 18:29:03 2002 Wed Aug 21 17:17:29 2002 Wed Aug 21 18:29:04 2002 Wed Aug 21 17:17:30 2002 Wed Aug 21 18:29:05 2002
I use netsaint to monitor system health, and it keeps complaining about the time jumps:
Aug 21 18:28:03 eclectic netsaint: Warning: A system time change of 4296 seconds (forwards in time) has been detected. Compensating... Aug 21 17:16:31 eclectic netsaint: Warning: A system time change of 4293 seconds (backwards in time) has been detected. Compensating... Aug 21 18:29:42 eclectic netsaint: Warning: A system time change of 4296 seconds (forwards in time) has been detected. Compensating... Aug 21 17:18:10 eclectic netsaint: Warning: A system time change of 4293 seconds (backwards in time) has been detected. Compensating...
Of course, random checks start failing as well since they occasionally run during the time jump, and if they're checking for time since something occured (log entry, cron run, query response time, etc) ...
I've tried "hwclock --systohc" to force the hardware clock to the same time as the system, didn't help. I'm running NTP and thought maybe it was having problems, so I tried turning that off, didn't help. I went through BIOS and tweaked memory/IRQ settings -- I didn't think it would really solve anything, but the setting changes and reboot kept the behavior away for ~14-15 hours.
So I'm completely stumped as to what causes this problem. Has anyone seen anything like this before? Any suggestions for things to look at? I'm at the point where I'm planning to just buy a new bit of hardware, migrate the drives and such again, and go from there. With my luck the problem would follow me of course... :(
Thanks in advance. :)
BTW: If anyone is wondering, I'm running SGI's XFS patched 2.4.18 kernel: 2.4.18-SGI_XFS_1.1. I'm thinking about dropping back to a 2.4.9 kernel w/ XFS support and seeing if that works any better. :( I was running 2.4.18 on the P200 for quite a while though without problem.
-- Randomly Generated Tagline: I don't want to look like a weirdo. I'll just go with a muumuu.
-- Homer Simpson King-Size Homer
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
On Wed, 21 Aug 2002, Michael Frysinger wrote: MF> i run 2.4.1{6,8}-xfs w/out any time issues MF> (so as to dispell the source of the problem being kernel) MF> MF> From: "Theo Van Dinter" <felicity@kluge.net> MF> > BTW: If anyone is wondering, I'm running SGI's XFS patched 2.4.18 kernel: MF> > 2.4.18-SGI_XFS_1.1. I'm thinking about dropping back to a 2.4.9 kernel MF> > w/ XFS support and seeing if that works any better. :( I was running MF> > 2.4.18 on the P200 for quite a while though without problem. the kernel didn't change, so that's not likely the problem... (I did try to warn Theo to not change too many things at once, but... still did the hardware upgrade at the same time as the RH7.2->7.3 upgrade... <sigh>) -- --==*==-- --==*==-- Michelle R. Vadeboncoeur --==*==-- --==*==-- mrv@kluge.net: http://www.kluge.net/~mrv/
On Wed, Aug 21, 2002 at 05:31:28PM -0400, Theo Van Dinter wrote: felicity> I've tried "hwclock --systohc" to force the hardware clock to the same felicity> time as the system, didn't help. I'm running NTP and thought maybe The hardware clock is only read during boot, so that couldn't be it... The system clock keeps time by the system timer interrupt, I believe. The timer interrupt is generated by the system's RTC chip. On x86 platforms it is fixed at 100 Hz (only Alpha uses a different value--1024 Hz). I remember reading that the HZ timer could be changed on any platform now, but maybe that was for 2.5 only. So, I wonder if it has something to do with the timer interrupt? Take a look at IRQ 0 in /proc/interrupts. Take a sampling of values over time, and figure out how many interrupts there are per second. Maybe that will shed some light on this... -- Charles R. Anderson <cra@wpi.edu> / http://angus.ind.wpi.edu/~cra/ PGP Key ID: 49BB5886 Fingerprint: EBA3 A106 7C93 FA07 8E15 3AC2 C367 A0F9 49BB 5886
On Wed, Aug 21, 2002 at 06:37:25PM -0400, Charles R. Anderson wrote:
The hardware clock is only read during boot, so that couldn't be it...
hmmm.
The system clock keeps time by the system timer interrupt, I believe. The timer interrupt is generated by the system's RTC chip. On x86 platforms it is fixed at 100 Hz (only Alpha uses a different value--1024 Hz). I remember reading that the HZ timer could be changed on any platform now, but maybe that was for 2.5 only.
That's about right, a quick 'while(1){grep ' 0' /proc/interrupts;sleep 1}' shows ~101 per second, with the overhead of doing the fork/grep...
Take a look at IRQ 0 in /proc/interrupts. Take a sampling of values over time, and figure out how many interrupts there are per second. Maybe that will shed some light on this...
I have a script checking per second and reporting if the differences is 90 > diff > 110. I figure a 2% skew in a second isn't horribly bad. Over 5 minutes, and nothing was displayed. :( -- Randomly Generated Tagline: "BUGS: This manpage is confusing." - man page for getopt
On Wed, Aug 21, 2002 at 07:15:14PM -0400, Theo Van Dinter wrote: felicity> That's about right, a quick 'while(1){grep ' 0' /proc/interrupts;sleep felicity> 1}' shows ~101 per second, with the overhead of doing the fork/grep... So let me get this straight...you are using the system clock to measure the system clock? :) I suggest writing down the value, counting five minutes with an EXTERNAL clock, such as a stopwatch, and then writing down the value again. (value2 - value1) / (60*5) should equal about 100. Of course, I don't see how a skew in the timer tick could cause the system time to change by a whole hour... There is also the possiblity of adjtime screwing up. There are syscalls related to adjusting the system clock used by NTP et. al. and I remember reading that xntpd had to keep up with changes between kernel revisions, or weird time setting issues would result. The program adjtimex may be interfering in some way. What does /sbin/adjtimex --print display? -- Charles R. Anderson <cra@wpi.edu> / http://angus.ind.wpi.edu/~cra/ PGP Key ID: 49BB5886 Fingerprint: EBA3 A106 7C93 FA07 8E15 3AC2 C367 A0F9 49BB 5886
On Wed, Aug 21, 2002 at 07:33:01PM -0400, Charles R. Anderson wrote:
So let me get this straight...you are using the system clock to measure the system clock? :)
;) The interrupt change would still be outside the 98 - 102 range if it got funky, but I know what you're saying. I'll test externally after dinner.
The program adjtimex may be interfering in some way. What does /sbin/adjtimex --print display?
<theo installs adjtimex> mode: 0 offset: 16038 frequency: 3138214 maxerror: 148248 esterror: 17792 status: 1 time_constant: 2 precision: 1 tolerance: 33554432 tick: 10000 raw time: 1029973970s 363977us = 1029973970.363977 -- Randomly Generated Tagline: "I sometimes think they choose guards basaed on the bone content of their heads." - Londo on Babylon 5
On Wed, Aug 21, 2002 at 07:53:33PM -0400, Theo Van Dinter wrote:
it got funky, but I know what you're saying. I'll test externally after dinner.
Ok, so I got caught up after dinner ... ;) I did go and change the BIOS settings again, around 11pm on Wednesday. The problem then stayed away until ~4:12am on the 23rd.
I suggest writing down the value, counting five minutes with an EXTERNAL clock, such as a stopwatch, and then writing down the value again. (value2 - value1) / (60*5) should equal about 100.
Ok, here we go: (date as reported from different system) Fri Aug 23 09:50:26 EDT 2002 CPU0 0: 13720765 XT-PIC timer 1: 3475 XT-PIC keyboard 2: 0 XT-PIC cascade 4: 753652 XT-PIC serial 8: 1 XT-PIC rtc 10: 320822 XT-PIC eth0, eth1 11: 996123 XT-PIC 3ware Storage Controller 15: 190389 XT-PIC ncr53c8xx NMI: 0 LOC: 13720778 ERR: 4588 MIS: 0 Fri Aug 23 10:28:45 EDT 2002 CPU0 0: 13950661 XT-PIC timer 1: 3475 XT-PIC keyboard 2: 0 XT-PIC cascade 4: 761173 XT-PIC serial 8: 1 XT-PIC rtc 10: 326570 XT-PIC eth0, eth1 11: 1008701 XT-PIC 3ware Storage Controller 15: 190389 XT-PIC ncr53c8xx NMI: 0 LOC: 13950673 ERR: 4622 MIS: 0 The first thing I notice is that int 4 is getting hit quite a bit, and I have no idea why. The device hooked up to that serial port is a UPS, and shouldn't be generating so many interrupts. Hmmm. Anyway, between the two times is 38:19, or 2299 seconds. So: (13950661-13720765) / 2299 = 99.9982601130927 So that part seems ok. :| -- Randomly Generated Tagline: That is a known bug in 5.00550. Either an upgrade or a downgrade will fix it. -- Larry Wall in <6vu1vo$89c@kiev.wall.org>
On Fri, Aug 23, 2002 at 10:33:30AM -0400, Theo Van Dinter wrote:
Ok, so I got caught up after dinner ... ;) I did go and change the BIOS settings again, around 11pm on Wednesday. The problem then stayed away until ~4:12am on the 23rd.
Just an update for those who are curious ... I decided to try reverting to 2.4.9 (actually 2.4.9-31SGI_XFS_1.1) and seeing what happened. So far, it's been almost 5 days without any issues. So far, so good. :| I just hope it's a 2.4.18 issue, so if I went to 2.4.19+ things would work. Guess I'll find out at some point. -- Randomly Generated Tagline: Even the Chinese are against me. -- Homer Simpson The Last Temptation of Homer
participants (4)
-
Charles R. Anderson
-
Michael Frysinger
-
Michelle Vadeboncoeur
-
Theo Van Dinter