Conditionally restarting systemd services

In a previous post I discussed how one of my systemd services was getting continuously restarted causing the CPU to spike. This isn’t ideal and after re-reading the systemd manual page I came across a couple of useful options to control when and how frequently a systemd service will restart. The first option is RestartSec which controls how long systemd will wait to restart a process after a failure occurs. Systemd also has the RestartForceExitStatus and RestartPreventExitStatus which allow you to define the signals that should or should not cause a restart. To ensure that logstash isn’t continuously restarted when a configuration issue is present I set RestartPreventExitStatus to 1. Abnormal signals like SIGSEGV, SIGABRT, etc. will trigger a restart but a return code of 1 will not.

Updating a file with ansible if the current file is X days old

I’m a heavy user of the Logstash geoip features which utilize the GeoLite database. To keep up to date with the latest mappings I updated my logstash ansible role to check the current database and retrieve a new one if its older than a certain number of days. This was super easy to do with ansible. To get started I defined a couple of variables in group_vars:

geoip_directory: "/elk/logstash/geoip"
geoip_source: http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.mmdb.gz
geoip_upgrade_days: 30

These variables define the location to put the geoip database, the URL to the latest database and how how often to update the file. To check if a file is outdated I used the stat module’s mtime attribute along with a when conditional:

- name: Get the GeoIP database file name
  set_fact: geoip_compressed_file_name="{{ geoip_source | basename }}"

- name: Get the GeoIP database file name
  set_fact: geoip_uncompressed_file_name="{{ geoip_compressed_file_name | replace('.gz', '') }}"

- name: Retrieving file stat data from {{ geoip_directory }}/{{ geoip_uncompressed_file_name }}"
  stat:
    path: "{{ geoip_directory }}/{{ geoip_uncompressed_file_name }}"
  register: stat_results

- name: "Download the latest GeoIP database from {{ geoip_source }}"
  get_url:
    url: "{{ geoip_source }}"
    dest: "{{ geoip_directory }}"
    mode: 0600
  when: ((ansible_date_time.epoch|int - stat_results.stat.mtime) > (geoip_upgrade_days * 60 * 60 * 24))
  register: downloaded_geoip_file

- name: "Uncompressing the GeoIP file {{ geoip_directory }}/{{ geoip_compressed_file_name }}"
  shell: gunzip -f "{{ geoip_directory }}/{{ geoip_compressed_file_name }}"
  when: downloaded_geoip_file.changed

I still need to add a couple of checks to deal with edge conditions but this is definitely a step up from what I was doing previously. Viva la ansible!

Troubleshooting a bizarre logstash CPU problem

This week I was converting some legacy shell scripts to ansible roles and wandered into a bizarre issue with one of my elasticsearch servers. After committing a couple of changes my CI system rejected the commit due to a system resource issue. When I logged into the system to troubleshoot the issue I noticed the CPU was pegged:

$ mpstat 1

Linux 3.10.0-514.26.2.el7.x86_64 (elastic02) 	09/02/2017 	_x86_64_	(2 CPU)

11:32:59 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:33:00 AM  all   94.03    0.00    3.98    0.00    0.00    0.00    0.00    0.00    0.00    1.99
11:33:01 AM  all   94.00    0.00    3.50    0.00    0.00    0.00    0.00    0.00    0.00    2.50
11:33:02 AM  all   93.50    0.00    5.50    0.00    0.00    0.00    0.00    0.00    0.00    1.00

This system is used solely to test changes so it should have been 100% idle. Htop showed the Logstash java process as the top CPU consumer so I ventured off to the log directory to see if anything was awry. Sure enough, Logstash was complaining about a missing file:

[2017-09-02T11:37:43,884][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Something is wrong with your configuration."}
[2017-09-02T11:37:52,529][ERROR][logstash.filters.geoip   ] Invalid setting for geoip filter plugin:

  filter {
    geoip {
      # This setting must be a path
      # File does not exist or cannot be opened /elk/logstash/geoip/GeoLite2-City.mmdb
      database => "/elk/logstash/geoip/GeoLite2-City.mmdb"
      ...
    }
  }

One of my commits automated the GeoIP database download process. This subtle change contained a bug (more on that in a future post) which was preventing the file from being downloaded and extracted. By default, logstash should exit if it detects a configuration issue. It shouldn’t spin on a CPU. To see what was going on I fired up my good friend strace and attmpted to attch to the PID I retrieved from ps:

$ strace -f -p 4288

strace: attach: ptrace(PTRACE_ATTACH, ...): No such process

Now that is odd. When I ran ps again and looked closer at the pid column I noticed that the java process had a new PID. I ran it a second time and the pid changed again. So it appeared that logstash was dying and then being restarted over and overr. The Logstash service on this machine is managed by systemd and it immediately dawned on me that the “Restart=on-failure” unit file configuration directive (this will restart a failed process) was the cause of my troubles! After a bit of troubleshooting I located the problem with my ansible role, corrected it and low and behold the system load returned to normal. Bugs and cascading failures happen and this problem never made its way to production because my automated test suite let me know that something was wrong. Thank you jenkins!