[mdlug] Server stability
Robert Adkins
radkins at impelind.com
Wed Jun 10 14:53:34 EDT 2009
Mike,
Rebooting under load can mean a whole slew of different things.
It's possible that the Northbridge chip on the mainboard is overheating and
causing a reboot. It's possible that the CPU is overheating and causing a
reboot. It's possible that the RAM is overheating...
There could be some software conflicts relating to an incompatability
between the software, the kernel and the CPU. (Recompiling the kernel
specific for the hardware in your server could be a fix for this.)
Do you have a way of monitoring the temperatures of the CPU and other
elments of the PC when a heavy load is being applied to the server?
-Rob
> -----Original Message-----
> From: mdlug-bounces at mdlug.org
> [mailto:mdlug-bounces at mdlug.org] On Behalf Of Michael ORourke
> Sent: Wednesday, June 10, 2009 2:41 PM
> To: MDLUG's main mailing list
> Subject: [mdlug] Server stability
>
> Lugnuts,
>
> I've got a server that is experiencing some stability issues.
> The server was built with CentOS 5.2 i386 and has been
> patched recently. It's running Oracle XE, tomcat, httpd, and
> some java. But when I put any load on the system, it just
> spontaneously reboots. Any thoughts and/or suggestions?
> Would we be better off reloading this box with an x86_64 bit
> version of CentOS?
> Here is some info on the server...
>
> --kernel
> [root at patchserv proc]# uname -a
> Linux patchserv.xxxxxxxxxxxxxxxxxx.xxx 2.6.18-128.1.6.el5PAE
> #1 SMP Wed Apr 1 10:02:22 EDT 2009 i686 i686 i386 GNU/Linux
>
> --memory info (8GB RAM)
> [root at patchserv log]# cat /proc/meminfo
> MemTotal: 8178228 kB
> MemFree: 300404 kB
> Buffers: 43884 kB
> Cached: 6252688 kB
> SwapCached: 44140 kB
> Active: 3276088 kB
> Inactive: 4392996 kB
> HighTotal: 7470840 kB
> HighFree: 17988 kB
> LowTotal: 707388 kB
> LowFree: 282416 kB
> SwapTotal: 2031608 kB
> SwapFree: 1987468 kB
> Dirty: 136 kB
> Writeback: 0 kB
> AnonPages: 1267384 kB
> Mapped: 684764 kB
> Slab: 144172 kB
> PageTables: 50672 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> CommitLimit: 6120720 kB
> Committed_AS: 3973640 kB
> VmallocTotal: 116728 kB
> VmallocUsed: 3900 kB
> VmallocChunk: 112712 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> Hugepagesize: 2048 kB
>
> --processor info (8 CPUs)
> processor : 7
> vendor_id : GenuineIntel
> cpu family : 15
> model : 4
> model name : Intel(R) Xeon(TM) CPU 2.80GHz
> stepping : 8
> cpu MHz : 2793.483
> cache size : 2048 KB
> physical id : 1
> siblings : 4
> core id : 1
> cpu cores : 2
> apicid : 7
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep
> mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse
> sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl est
> cid cx16 xtpr lahf_lm
> bogomips : 5586.43
>
> --recent reboots (not user initiated)
> [root at patchserv log]# last | grep reboot
> reboot system boot 2.6.18-128.1.6.e Wed Jun 10 07:42
> (06:50)
> reboot system boot 2.6.18-128.1.6.e Wed Jun 10 07:33
> (00:04)
> reboot system boot 2.6.18-128.1.6.e Tue Jun 9 01:12
> (1+06:24)
> reboot system boot 2.6.18-128.1.6.e Tue Jun 9 01:03
> (00:04)
> reboot system boot 2.6.18-128.1.6.e Tue Jun 9 00:26
> (00:41)
> reboot system boot 2.6.18-128.1.6.e Tue Jun 9 00:18
> (00:03)
>
> --/var/log/messages around the time of the last reboot...
> Jun 10 07:33:26 patchserv kernel: ADDRCONF(NETDEV_CHANGE):
> eth0: link becomes ready Jun 10 07:33:26 patchserv kernel:
> ADDRCONF(NETDEV_UP): eth1: link is not ready Jun 10 07:37:30
> patchserv kdump: saved a vmcore to
> /var/crash/2009-06-10-07:33 Jun 10 07:37:31 patchserv
> shutdown[3048]: shutting down for system reboot Jun 10
> 07:37:31 patchserv init: Switching to runlevel: 6 Jun 10
> 07:37:32 patchserv rpc.statd[2983]: Caught signal 15,
> un-registering and exiting.
> Jun 10 07:37:32 patchserv portmap[3087]: connect from
> 127.0.0.1 to unset(status): request from unprivileged port
> Jun 10 07:37:33 patchserv auditd[2888]: The audit daemon is exiting.
> Jun 10 07:37:33 patchserv kernel: audit(1244633853.205:5):
> audit_pid=0 old=2888 by auid=4294967295 Jun 10 07:37:33
> patchserv kernel: Kernel logging (proc) stopped.
> Jun 10 07:37:33 patchserv kernel: Kernel log daemon terminating.
> Jun 10 07:37:34 patchserv exiting on signal 15 Jun 10
> 07:42:29 patchserv syslogd 1.4.1: restart.
> Jun 10 07:42:29 patchserv kernel: klogd 1.4.1, log source =
> /proc/kmsg started.
> Jun 10 07:42:29 patchserv kernel: Linux version
> 2.6.18-128.1.6.el5PAE (mockbuild at builder10.centos.org) (gcc
> version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Apr 1
> 10:02:22 EDT 2009 Jun 10 07:42:29 patchserv kernel:
> BIOS-provided physical RAM map:
> Jun 10 07:42:29 patchserv kernel: BIOS-e820:
> 0000000000000000 - 00000000000a0000 (usable) Jun 10 07:42:29
> patchserv kernel: BIOS-e820: 0000000000100000 -
> 00000000bffc0000 (usable) Jun 10 07:42:29 patchserv kernel:
> BIOS-e820: 00000000bffc0000 - 00000000bffcfc00 (ACPI data)
> Jun 10 07:42:29 patchserv kernel: BIOS-e820:
> 00000000bffcfc00 - 00000000bffff000 (reserved) Jun 10
> 07:42:29 patchserv kernel: BIOS-e820: 00000000e0000000 -
> 00000000fec90000 (reserved) Jun 10 07:42:29 patchserv kernel:
> BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
> Jun 10 07:42:29 patchserv kernel: BIOS-e820:
> 00000000fee00000 - 00000000fee10000 (reserved) Jun 10
> 07:42:29 patchserv kernel: BIOS-e820: 00000000ffb00000 -
> 0000000100000000 (reserved) Jun 10 07:42:29 patchserv kernel:
> BIOS-e820: 0000000100000000 - 00000001ffffe000 (usable) Jun
> 10 07:42:29 patchserv kernel: BIOS-e820: 00000001ffffe000 -
> 0000000200000000 (reserved) Jun 10 07:42:29 patchserv kernel:
> BIOS-e820: 0000000200000000 - 0000000240000000 (usable) Jun
> 10 07:42:29 patchserv kernel: 8320MB HIGHMEM available.
> Jun 10 07:42:29 patchserv kernel: 896MB LOWMEM available.
> Jun 10 07:42:29 patchserv kernel: found SMP MP-table at
> 000fe710 Jun 10 07:42:29 patchserv kernel: NX (Execute
> Disable) protection: active Jun 10 07:42:29 patchserv kernel:
> DMI 2.3 present.
> Jun 10 07:42:29 patchserv kernel: Using APIC driver default
> Jun 10 07:42:29 patchserv kernel: ACPI: PM-Timer IO Port:
> 0x808 Jun 10 07:42:29 patchserv kernel: ACPI: LAPIC
> (acpi_id[0x01] lapic_id[0x00] enabled) Jun 10 07:42:29
> patchserv kernel: Processor #0 15:4 APIC version 20 Jun 10
> 07:42:29 patchserv kernel: ACPI: LAPIC (acpi_id[0x02]
> lapic_id[0x06] enabled) Jun 10 07:42:29 patchserv kernel:
> Processor #6 15:4 APIC version 20 Jun 10 07:42:29 patchserv
> kernel: ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
> Jun 10 07:42:29 patchserv kernel: Processor #2 15:4 APIC
> version 20 Jun 10 07:42:29 patchserv kernel: ACPI: LAPIC
> (acpi_id[0x04] lapic_id[0x04] enabled) Jun 10 07:42:29
> patchserv kernel: Processor #4 15:4 APIC version 20 Jun 10
> 07:42:29 patchserv kernel: ACPI: LAPIC (acpi_id[0x05]
> lapic_id[0x01] enabled) Jun 10 07:42:29 patchserv kernel:
> Processor #1 15:4 APIC version 20 Jun 10 07:42:29 patchserv
> kernel: ACPI: LAPIC (acpi_id[0x06] lapic_id[0x07] enabled)
> Jun 10 07:42:29 patchserv kernel: Processor #7 15:4 APIC
> version 20 Jun 10 07:42:29 patchserv kernel: ACPI: LAPIC
> (acpi_id[0x07] lapic_id[0x03] enabled) Jun 10 07:42:29
> patchserv kernel: Processor #3 15:4 APIC version 20 Jun 10
> 07:42:29 patchserv kernel: ACPI: LAPIC (acpi_id[0x08]
> lapic_id[0x05] enabled) Jun 10 07:42:29 patchserv kernel:
> Processor #5 15:4 APIC version 20
>
> Thanks,
> Mike
> _______________________________________________
> mdlug mailing list
> mdlug at mdlug.org
> http://mdlug.org/mailman/listinfo/mdlug
>
More information about the mdlug
mailing list