Installing Nagios server and NRPE

What is Nagios?

Nagios Core – Free and open source
        nagios.org
Nagios XI – Enterprise level, paid with support features etc.
        nagios.com

Nagios Server

  • Runs the service and host checks that you define. Configuration definitions, sends emails/third party phone calls and SMS, web interface

NRPE – Nagios Remote Plugin Executor

  • Local agent that allows nagios server to get a command return message from the host you’re checking (disk usage, cpu load)

 

Let’s Build It!

Requirements:

CentOS 6.x or 7.x minimal
Ideally 2 Linux VMs, 1 to be the Nagios server and 1 to be a host to check against
Root access
###########################################################################
* Verify SELinux being disabled or in permissive mode
sudo setenforce 0
This will set SELinux to permissive mode without needed a restart
* Install needed packages
sudo yum install epel-release

sudo yum install nagios nagios-plugins-all nagios-plugins-nrpe nrpe php httpd vim

 ** If you have iptables, or firewalld running you’ll want to open up port 80 and 5666. CentOS minimal does not come with these installed **

Change admin password
sudo htpasswd /etc/nagios/passwd nagiosadmin
Enter your password
Enable nagios and httpd on boot
sudo chkconfig httpd on && chkconfig nagios on
Fire it up!
sudo service httpd start
sudo service nagios start
In a web browser:
<ip addr>/nagios
You’ve started Nagios! Login with:
User: nagiosadmin
Password: <the new password you just created>

Now Let’s Look at configuration

sudo su
cd /etc/nagios
ls
cgi.cfg conf.d/ nagios.cfg objects/ passwd private/
objects/ and nagios.cfg are the things you care most about right now
cd objects/
ls
commands.cfg hosts.cfg printer.cfg switch.cfg timeperiods.cfg
contacts.cfg localhost.cfg services.cfg templates.cfg windows.cfg

Host Checks

Is the server up? This can be a ping to an ip address, DNS check, or website

Service Checks

CPU load, swap usage, disk utilization, process running etc.
Default install will configure a template for localhost – Go to your <ip addr>/nagios —> On the left-hand side click Services
  • Current Load, Current Users, HTTP, PING, Root Partition, SSH, Swap Usage, Total Processes
Hierarchy:
Any .cfg file will be processed as a standalone file. So localhost.cfg can be set by itself, You could use a dynamic CI/CD approach and make a .cfg file for every one of your servers, but that would be crazy town. Or would it? << Show some of the dynamic config we have for prod and stage at Craftsy>>
You can create templates for teams to alert on certain servers, have phone calls or just emails, non-alerting stuff etc.

061183nagios03.png

Image via: https://www.rittmanmead.com/blog/2012/09/an-introduction-to-monitoring-obiee-with-nagios/

 

Let’s Define Contacts First

vim contacts.cfg
 Let’s make a new contact, and add a service to that contact
define contactgroup {
       contactgroup_name       ops
       alias                   Ops Team
       members                 nagiosadmin
}
—> Update nagiosadmin email to be your email
email nagios@localhost ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
Look back at templates.cfg, and the top one for generic-contact
name                           The name that you call in other .cfg configuration files
service_notification_period    The time that you want *service* alerts to fire. Can configure work-hours, 24x7 etc
host_notification_period       The time you want *host* alerts to fire
service_notification_options   Warning, Unknown, Critical, Resolved, Flapping, Scheduled downtime
host_notification_options      Down, Unknown, Resolved, Flapping, Scheduled downtime
service_notification_commands  Define email alerts, third party integrations (VictorOps, PagerDuty, OpsGenie)
host_notification_commands     Define alerts for host notifications
register                       Partial definition or not
         ** A note about the register portion. Use this if you are making a template that is a partial object definition. This allows inheritance within other definitions **
So we added our contact to the new contact we made, and told it to use the generic-contact, so that configuration will apply to our newly added contact

Let’s Define A Host To Alert On

Make a new file called hosts.cfg
define host {
    host_name       sofree
    alias           Software Freedom School
    address         sofree.us
    use             generic-host
    contacts        ops        ; The contact we just made
}

Now Define What The Parameters Of That Host Check Should Be

vim templates.cfg
define host {
    name sofree-host
    use generic-host ; This grabs the notification period, notifications enabled, flap detection etc
    check_period 24x7 ; What hours this should check
    check_interval 5 ; How often to check, in minutes
    retry_interval 1 ; How often to retry when it fails
    max_check_attempts 10 ; How many times to retry until it alerts. In this config, you will get an alert after 10 minutes of the server being down
    check_command check-host-alive; Another template for how to check for the host, currently a template for a simple ping. You may make a different host check for http host alive, etc.
 notification-options d,u,r ; When should notify happen - Down, Up, Resolved
    contacts ops ; Who to alert to, options are contacts or contact groups
    register 0 ; Make this a template
}
service nagios restart
!! You’ll get an error!!
 << nagios-options should be notification_options — be careful of underscores vs dashes. But you can use the Nagios pre flight check to verify your syntax before you take the service offline and in a bad state >> 
nagios -v /etc/nagios/nagios.cfg
 You’ll see this error:
Error: Invalid host object directive ' '.
Error: Could not add object property in file '/etc/nagios/objects/templates.cfg' on line 199.
 Error processing object config files!

This is because the notification options directive should have an underscore, not a dash

notification_options d,u,r ; When should notify happen - Down, Up, Resolved

Tell the Main Config to Include Your New Config files

vim /etc/nagios/nagios.cfg
cfg_file=/etc/nagios/objects/hosts.cfg
Let’s see the new config!
service nagios restart
Now everyone else go and set up a single host check. Use the sofree.us site or another one of your favorites.

Install NRPE on a separate host

Disable SELinux

setenforce 0
yum install epel-release wget gcc openssl-devel
cd /tmp
wget http://nagios-plugins.org/download/nagios-plugins-2.2.1.tar.gz
tar -xzf nagios-plugins-2.2.1.tar.gz
cd nagios-plugins-2.2.1
./configure
make
make install
yum install xinetd
cd ..
wget https://github.com/NagiosEnterprises/nrpe/releases/download/nrpe-3.2.1/nrpe-3.2.1.tar.gz
tar -xzf nrpe-3.2.1.tar.gz
cd nrpe-nrpe-3.2.1
./configure
make all
make install-groups-users
chown -R nagios.nagios /usr/local/nagios
make install
make install-config
make install-init
service xinetd restart
chkconfig nrpe on && service nrpe start
Now you need to allow the Nagios server to access NRPE plugins
vim /usr/local/nagios/etc/nrpe.cfg
allowed_hosts=<ip addr of server>

Verify NRPE is running

/usr/local/nagios/libexec/check_nrpe -H localhost
Disable IPV6 (not necessary if compiling from source and adding different flags)
vim /etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
sysctl -p
service nrpe restart

Let’s look through the different plugins

ls /usr/local/nagios/libexec

Install NRPE on Nagios Server

wget https://github.com/NagiosEnterprises/nrpe/releases/download/nrpe-3.2.1/nrpe-3.2.1.tar.gz
tar -xzf nrpe-3.21.tar.gz
cd nrpe-nrpe-3.2.1
./configure
make check_nrpe
make install-plugin
Verify that server can run commands against your defined host
/usr/local/nagios/libexec/check_nrpe -H <ip addr of host> -4 -c check_load
Add services.cfg to the main nagios config
vim /etc/nagios/nagios.cfg
cfg_file=/etc/nagios/objects/services.cfg
vim services.cfg
Create services.cfg
define service {
   use sofree-service
   host_name nrpe_test
   service_description check_load
   check_command check_nrpe!check_load
}
define service {
   use sofree-service
   host_name nrpe_test
   service_description check_xvda1
   check_command check_nrpe!check_hda1
}

NRPE  commands need to be defined in 3 places

1) On server –> services.cfg, or other .cfg file

check_nrpe!check_load
check_nrpe!check_hda1

2) On server –> commands.cfg

define command {
 command_name check_nrpe
 command_line $USER1$/check_nrpe -u -H $HOSTADDRESS$ -c $ARG1$
}

3) On host –> /etc/nagios/nrpe.cfg || /usr/local/nagios/etc/nrpe.cfg

On host machine, match up the command with the argument you’re passing

command[check_load]=/usr/local/nagios/libexec/check_load -r -w .15,.10,.05 -c .30,.25,.20
command[check_hda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/xvda1

Now you go and define a couple services on your host

Logs are located at /var/logs/nagios/nagios.log

Additional material:

• nagios_email_ack

• Nagdash

• Converting epoch time:

cat /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'

• Nagios dynamic

• Custom commands (plugins, external scripts etc). API calls are a great use of external scripts.

Advertisements

Rsyslog to clean up /var/log/messages

Situation: Not all applications and utilities have been developed to log their messages intelligently or to their own location.

Complication: This can lead to logging bloat, especially for messages that are purely informational and noisy. /var/log/messages can be overwhelmed by this and can make it harder to figure out actual problems.

Question: How can you redirect messages to clean up /var/log/messages?

Answer: rsyslog to the rescue!

 

For the purposes of this post I will be analyzing the program amazon-ssm-agentThis is an Amazon proprietary program necessary to run AWS Run Command. This is a good example because it was developed to have it’s own log file, but also still fills /var/log/messages. We will:

  1. Go through the workflow of syslog, systemd and how messages get into logs
  2. Look at messages in /var/log/messages that need to be filtered
  3. Configure rsyslog to send all application logs to it’s own file
  4. Use logrotate to create a good policy for how long logs stay around
  5. Celebrate clean logs

 

Linux tools to parse through files

Situation: You have a lot of results in a file that you need to move through and single out and add or subtract to fields, compare entries, or many other needs

Complication: There are a TON of Linux tools to do this

Question: What can do what?

Answer: I will add to this post with specific examples consistently

 

Example 1: Search through a file list and return only the filename, not the full path

Let’s say you want to just look for missing files in a non-similar directory structure between multiple file lists. One way is to look through each file without the file path and get just the file name. Let’s say filelist1 has this list:

/tmp/derp/directoryone/file1
/tmp/derp/directoryone/file2
/tmp/derp/directoryone/file3
/tmp/derp/directoryone/file4

Let’s now say filelist2 has this list:

/tmp/der/directorytwo/file1
/tmp/der/directorytwo/file2
/tmp/der/directorytwo/file3

In this case you want to just figure out that you’re missing file 4, so that you could sync just that file and not all the contents of directoryone. So to do this you could use cat to show the file contents, and pipe it to cut to get just the filename

less filelist1 | cut -d/ -f4

The -d/ option is saying to use the as the delimiter, and the -f4 is saying to return the 4th entry

This would return

file1
file2
file3
file4

You could then send it to a new file that had just the filenames with >>

less filelist1 | cut -d/ -f4 >> filelist1_clean
less filelist2 | cut -d/ -f4 >> filelist2_clean

You can now easily find just the files missing by running

diff filelist1_clean filelist2_clean

You would then get

file4

Now you can make your sync off of this

Nagios acknowledge through email

Situation:

Nagios is a widely used alerting system

Complication:

Sometimes you’re out to dinner and get an alert that is not immediately actionable until you finish desert

Question:

Can you ack the alert without having to patch in and ack through the nagios UI?

Answer:

Yes! You can ack alerts with a simple email reply with the words “ACK”

Avleen Vig wrote a great python script to poll the nagios inbox, parse the alert info and acknowledge the problem if ACK is in the message

 

  1. Install Nagios
  2. The base nagios install does not include a home directory and login for the nagios user, so create it manually
    mkdir /home/nagios
  3. Create IMAP inbox for nagios to use (for both sending and receiving). This can be done through Gmail or any other IMAP server you have access to
  4. Copy Avleen’s script from Github to /home/nagios/
  5. chmod the script to be executable
    chmod 760 nagios_email_handler.py
  6. Edit nagios_email_handler.py to match the nagios CMD file that is in your environment
    40 CMD_FILE = '/usr/local/nagios/var/rw/nagios.cmd'

    OR

    40 CMD_FILE=/etc/nagios/var/rw/nagios.cmd'
  7. Put in your IMAP information into the script
     42 # IMAP server, username and password
     43 IMAP_SERVER = 'imap.example.com'
     44 IMAP_USER = 'imapuser@example.com'
     45 IMAP_PASS = "imap_password"
  8. If your host names in Nagios are longer than ~15 characters, then Gmail (and potentially others) will automatically make a new line to account for that, even though the Subject line is 1 line. Get around this by adding the ability to handle new lines within the script with \n at the end of ACK
    152 if alert_class == 'Host':
    153 msg = '[%s] ACKNOWLEDGE_HOST_PROBLEM;%s;1;1;1;%s;ACK\n' % \
    154 (now, server, fromaddr)
    155 elif alert_class == 'Service':
    156 msg = '[%s] ACKNOWLEDGE_SVC_PROBLEM;%s;%s;1;1;1;%s;ACK\n' % \
    157 (now, server, service, fromaddr)
    158 open(CMD_FILE, 'w').write(msg)
    159 LOGGER.info('ACKed alert: From: %s, Host: %s, Service: %s\n' % \
    160 (fromaddr, server, service))
  9. Cron the script to run every minute to search for new acknowledgements
    crontab -e
    SHELL=/bin/bash
    * * * * * /usr/bin/python $HOME/nagios_email_handler.py >> /var/log/nagios/email_ack.log 2>&1
  10. Test by purposefully getting nagios to alert, and then respond with an email with just the contents “ACK”. Look in /var/log/nagios/email_ack.log. Make sure the information is getting parsed correctly. You should see something like this:
    Service, user@example.com, hostname, disk_usage, ack

One-liner for largest files/folders

Do you ever find yourself wanting to know what the deuce is taking up all your space? Here’s a simple one-liner to figure out the top 20 folders eating your disk:

du -ha <folder_location*> | sort -h | tail -n20

du is the disk usage utility that teamed with sort and tail can be very valuable. Usually you would want to make the <folder_location> be at the root of the volume (/) but maybe you have areas you don’t want searched. Just include the exclude!

du -ha /* --exclude "/home/cookie.monster/Videos" | sort -h | tail -n20

This will work for Linux, but if you’re on OSX you will likely not have the sort command. First you’ll need to install brew, which is a package manager for OSX:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Here’s the official Brew site with context on the install
Once you’ve got Brew, you can install the GNU core utilities
brew install coreutils

You can now use gsort -g on OSX in place of sort -h on Linux with the same results

ADCLI on CentOS 6

If you are trying to bind a Linux machine to Active Directory, a very simple tool to use is ADCLI. You can use it to join servers, query AD, and also add/delete objects. It is not currently available through the yum repos, however. Here’s how to get past that:

  1. Download http://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/os/SRPMS/adcli-0.8.1-1.el6.src.rpm
  2. You’ll need to rebuild the source rpm like this:
    1. rpmbuild –rebuild adcli-0.8.1-1.el6.src.rpm
  3. You may end up with needing other dependencies such as openldap-devel and xmlto (and potentially others). You can install these through yum:
    1. yum install -y openldap-devel xmlto
  4. To bind a new server:
    1. adcli join -U <ADadminUser> domain
    2. Enter password
  5. Verify the bind was a success by querying AD:
    1. adcli info <domain>

You should now be able to login with any AD user on that machine and do application level Active Directory integration

AWS Run Command

Amazon Web Services recently came out with a new feature called “Run Command”. If you have instances in AWS it allows you to send a set of commands to a subset (or all) of your instances, with the ability for extended logging of the output sent to an S3 bucket, if you wish.

People will sometimes use tools such as Puppet to send a new system configuration that may only be a single command, such as a systemctl enable command. But the AWS Run Command will let you do this without having to create a Puppet module for a small set of commands.

Pre-requisites

  • amazon-ssm-agent must be running.
  • Security groups must be configured to allow this agent

Security Groups

AWS Identity and Access Management has a default SSM Policy which you can apply to your instances if you just search for it in IAM:

Screen Shot 2016-04-20 at 12.34.13 AM.png

After you have applied the AmazonEC2RoleforSSM you need to get the amazon-ssm-agent service running. You can bake this into whichever AMI baking tool you are using, or configuration management (Chef, Puppet etc). But to test you can set it up like this for Linux instances upon creation or in a script run by your configuration management software:

cd /tmp 
curl https://amazon-ssm-<your-region>.s3.amazonaws.com/latest/linux_amd64/amazon-ssm-agent.rpm -o amazon-ssm-agent.rpm
yum -y install amazon-ssm-agent.rpm

Start the service for CentOS 7.x using:

sudo systemctl start amazon-ssm-agent

For CentOS 6.x use:

service start amazon-ssm-agent

Once amazon-ssm-agent is running, then you can issue commands to your instances. Here’s the basics on how to do it:

In AWS Console –> EC2 –> Commands

Screen Shot 2016-04-19 at 11.35.52 AM

Then select the type of Command you wish to run, for example a simple Shell Script:

Screen Shot 2016-04-19 at 11.28.30 AM

You can then choose to send the full output of your commands to an S3 bucket:

Screen Shot 2016-04-19 at 11.45.14 AM

Once the command is run you can view the output by clicking “View result”:

Screen Shot 2016-04-19 at 11.42.00 AM

 

The AWS Run Command is extremely powerful for one-off commands to be run without the overhead and delay of a configuration management tool. The amazon-ssm-agent takes a little extra time to set up, but once you have it started you can utilize the Run Command at no cost other than minimal cross-region traffic costs.