Getting the ansible yum module to work on Fedora servers

This article was posted by Matty on 2017-09-07 13:44:00 -0400 -0400

I was doing some testing this morning on a Fedora 25 host and received the following error when I tried to execute a playbook:

$ ansible-playbook --ask-become-pass -l tbone playbooks/system-base.yml

PLAY [all] ********************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************
ok: [tbone]

TASK [system-common : upgrade all packages] ***********************************************************************
fatal: [tbone]: FAILED! => {"changed": false, "failed": true, "msg": "python2 yum module is needed for this  module"}
	to retry, use: --limit @/ansible/playbooks/system-base.retry

PLAY RECAP ********************************************************************************************************
tbone        : ok=1    changed=0    unreachable=0    failed=1

To see what ansible was doing I set the ANSIBLE_KEEP_REMOTE_FILES environment variable which keeps the ansiballz modules on the remote host (this is super useful for debugging problems). After reviewing the files in the temporary task directory I noticed that the playbook had a task to install a specific version of a package with yum. Yum doesn’t exist on newer Fedora releases hence the “python2 yum module” error.

There are a couple of ways to fix this. The ideal way is to use the package module (or check the OS release and use the ansible dnf module instead of yum) instead of specifying yum or dnf. If you need a quick fix you can shell out from your playbook and install python2-dnf prior to gathering facts. If you need a quicker fix you can install the package by hand:

$ yum -y install python2-dnf

I’m currently using the package module and it works like a champ.

Viewing ansible variables

This article was posted by Matty on 2017-09-04 06:16:00 -0400 -0400

When developing ansible playbooks and roles it’s extremely useful to be able to see all of the variables available to you. This is super easy with the ansible setup and debug modules:

List all of the vars available to the host:
$ ansible haproxy01. -m setup

Retrieve all of the groups from the inventory file:
$ ansible haproxy01. -m debug -a "var=groups"

Lester Wade took this a step further and wrote a great blog entry that describes how to dump the contents of the vars, environment, group_names, hostvars and group variables to a file. If you run his example you will get a nicely formatted text file in /tmp/ansible.all

Module Variables ("vars"):
--------------------------------
{
"ansible_all_ipv4_addresses": [
"192.168.1.122",
"192.168.1.124"
],
"ansible_all_ipv6_addresses": [
"fe80::250:56ff:fe8f:b8ad"
],
"ansible_apparmor": {
"status": "disabled"
},

This file is a great reference and kudos to Lester for the amazing work!

Conditionally restarting systemd services

This article was posted by Matty on 2017-09-03 08:55:00 -0400 -0400

In a previous post I discussed how one of my systemd services was getting continuously restarted causing the CPU to spike. This isn’t ideal and after re-reading the systemd manual page I came across a couple of useful options to control when and how frequently a systemd service will restart. The first option is RestartSec which controls how long systemd will wait to restart a process after a failure occurs. Systemd also has the RestartForceExitStatus and RestartPreventExitStatus which allow you to define the signals that should or should not cause a restart. To ensure that logstash isn’t continuously restarted when a configuration issue is present I set RestartPreventExitStatus to 1. Abnormal signals like SIGSEGV, SIGABRT, etc. will trigger a restart but a return code of 1 will not.

Updating a file with ansible if the current file is X days old

This article was posted by Matty on 2017-09-02 13:56:00 -0400 -0400

I’m a heavy user of the Logstash geoip features which utilize the GeoLite database. To keep up to date with the latest mappings I updated my logstash ansible role to check the current database and retrieve a new one if its older than a certain number of days. This was super easy to do with ansible. To get started I defined a couple of variables in group_vars:

geoip_directory: "/elk/logstash/geoip"
geoip_source: http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.mmdb.gz
geoip_upgrade_days: 30

These variables define the location to put the geoip database, the URL to the latest database and how how often to update the file. To check if a file is outdated I used the stat module’s mtime attribute along with a when conditional:

- name: Get the GeoIP database file name
  set_fact: geoip_compressed_file_name="{{ geoip_source | basename }}"

- name: Get the GeoIP database file name
  set_fact: geoip_uncompressed_file_name="{{ geoip_compressed_file_name | replace('.gz', '') }}"

- name: Retrieving file stat data from {{ geoip_directory }}/{{ geoip_uncompressed_file_name }}"
  stat:
    path: "{{ geoip_directory }}/{{ geoip_uncompressed_file_name }}"
  register: stat_results

- name: "Download the latest GeoIP database from {{ geoip_source }}"
  get_url:
    url: "{{ geoip_source }}"
    dest: "{{ geoip_directory }}"
    mode: 0600
  when: ((ansible_date_time.epoch|int - stat_results.stat.mtime) > (geoip_upgrade_days * 60 * 60 * 24))
  register: downloaded_geoip_file

- name: "Uncompressing the GeoIP file {{ geoip_directory }}/{{ geoip_compressed_file_name }}"
  shell: gunzip -f "{{ geoip_directory }}/{{ geoip_compressed_file_name }}"
  when: downloaded_geoip_file.changed

I still need to add a couple of checks to deal with edge conditions but this is definitely a step up from what I was doing previously. Viva la ansible!

Troubleshooting a bizarre logstash CPU problem

This article was posted by Matty on 2017-09-02 12:03:00 -0400 -0400

This week I was converting some legacy shell scripts to ansible roles and wandered into a bizarre issue with one of my elasticsearch servers. After committing a couple of changes my CI system rejected the commit due to a system resource issue. When I logged into the system to troubleshoot the issue I noticed the CPU was pegged:

$ mpstat 1

Linux 3.10.0-514.26.2.el7.x86_64 (elastic02) 09/02/2017 _x86_64_ (2 CPU)

11:32:59 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:33:00 AM all 94.03 0.00 3.98 0.00 0.00 0.00 0.00 0.00 0.00 1.99
11:33:01 AM all 94.00 0.00 3.50 0.00 0.00 0.00 0.00 0.00 0.00 2.50
11:33:02 AM all 93.50 0.00 5.50 0.00 0.00 0.00 0.00 0.00 0.00 1.00

This system is used solely to test changes so it should have been 100% idle. Htop showed the Logstash java process as the top CPU consumer so I ventured off to the log directory to see if anything was awry. Sure enough, Logstash was complaining about a missing file:

[2017-09-02T11:37:43,884][ERROR][logstash.agent ] Cannot create pipeline {:reason=>"Something is wrong with your configuration."}
[2017-09-02T11:37:52,529][ERROR][logstash.filters.geoip ] Invalid setting for geoip filter plugin:

filter {
geoip {
# This setting must be a path
# File does not exist or cannot be opened /elk/logstash/geoip/GeoLite2-City.mmdb
database => "/elk/logstash/geoip/GeoLite2-City.mmdb"
...
}
}

One of my commits automated the GeoIP database download process. This subtle change contained a bug (more on that in a future post) which was preventing the file from being downloaded and extracted. By default, logstash should exit if it detects a configuration issue. It shouldn’t spin on a CPU. To see what was going on I fired up my good friend strace and attmpted to attch to the PID I retrieved from ps:

$ strace -f -p 4288

strace: attach: ptrace(PTRACE_ATTACH, ...): No such process

Now that is odd. When I ran ps again and looked closer at the pid column I noticed that the java process had a new PID. I ran it a second time and the pid changed again. So it appeared that logstash was dying and then being restarted over and overr. The Logstash service on this machine is managed by systemd and it immediately dawned on me that the “Restart=on-failure” unit file configuration directive (this will restart a failed process) was the cause of my troubles! After a bit of troubleshooting I located the problem with my ansible role, corrected it and low and behold the system load returned to normal. Bugs and cascading failures happen and this problem never made its way to production because my automated test suite let me know that something was wrong. Thank you jenkins!