[mdlug] Fwd: [Nagios problem]

Wed Aug 20 08:29:34 EDT 2014

David L Lambert wrote:
> I installed the Nagios controller stuff (packages "nagios3-cgi",
> "nagios3-core") on the Debian 7 VM.  Problem is, it generates a big
> batch of alerts, including ones like the below that are just about being
> unable to connect via SSH to one of the hosts it's monitoring, two or
> three times a day.  The VM doesn't seem to be underpowered on memory:
>
> -------- Original Message --------
> Subject: 	** PROBLEM Service Alert: Compaq Presario 5000 [***] in
> basement/Current Load is CRITICAL **
> Date: 	Wed, 20 Aug 2014 06:29:36 -0400
> From: 	nagios at lmert.com
> To: 	root at localhost
>
>
>
> ***** Nagios *****
>
> Notification Type: PROBLEM
>
> Service: Current Load
> Host: Compaq Presario 5000 [***] in basement
> Address: 192.168.[XX.XX]
> State: CRITICAL
>
> Date/Time: Wed Aug 20 06:29:29 EDT 2014
>
> Additional Info:
>
> CRITICAL - Plugin timed out after 10 seconds

Hi David,

You mentioned that there is a problem when connecting
with ssh, but I don't see any indication of that above.
It's saying that the "Current Load" service is timing
out, and that could happen for many reasons including
intermittent ssh failure, by the client or the server.

I would start by looking at the system logs on both
ends at the time that the plugin timed out.  If you
happen to be there while the problem is occurring,
atop is your friend, providing that you've already
got it installed.

Consider installing sysstat and using sar to see if it
gives you any clues.  My CentOS systems by default grab
various statistics every 10 minutes and keeps them for
a month.  This is good because it provides a baseline
of what is normal throughout the day and throughout the
week.

If sar doesn't point you at the reason, nmon will.  The
drawbacks are that it's more difficult to configure, it
will use more resources, and it can be hard to decipher
the output.  But the advantage is that it records a *lot*
of system activity as often as you run the cronjob.  On
some systems I ran it every minute of every day while
looking for certain intermittent failures.

Yeah, I realize this sounds like work, but I'd take this
approach rather than trying to guess causes or tinkering
with the nice command.  On a small installation like you
describe nagios doesn't need a lot of resources, and it
tends to work like a canary in a cave.

c