Using the Ansible async module to perform synthetic health tests during playbook runs


I’ve become a huge fan of Ansible’s async support. This is incredibly useful for performing synthetic health tests on services after a task completes. A common use case is patching a server that hosts one or more applications. You ideally want to gracefully stop your application, update packages, reboot, and then test the application to ensure its healthy. Updating the system and scheduling the reboot is easy to accomplish with serial, reboot and a sentinel creation script:

hosts: "{{ server_list }}"
  become: true
  serial: 1
  tasks:
    - name: Create the restart sentinel script
      copy:
        src:   ../files/restart-sentinel
        dest:  /usr/local/bin/restart-sentinel
        mode:  '0700'
        owner: 'root'
        group: 'root'

    - name: Upgrade Operating System packages
      yum:
        name:  '*'
        state: latest
     register: yum_updates_applied

    - name: Run the create-sentinel script to see if an update is required
      command: /usr/local/bin/restart-sentinel
      when: yum_updates_applied.changed

    - name: Check to see if the restart sentinel file exists
      stat:
        path: /run/reboot-required
      register: sentinel_exists

    - name: Gracefully stop app1
      systemd:
        name:  app1
        state: stopped
      when: sentinel_exists.stat.exists

    - name: Check to make sure app1 exited
      shell: |
        while true; do
            count=$(ps auxw | grep -c [a]pp1)
            if [[ $count -ne 0 ]]; then
               echo "App1 is still running"
               sleep 1
            else
               echo "All app1 processes are down"
               exit 0
            fi
        done
      async: 600
      poll: 5
      when: sentinel_exists.stat.exists

    - name: Reboot the server to pick up a kernel update
      reboot:
        reboot_timeout: 600
      when: sentinel_exists.stat.exists

Once the server reboots, how do you know your application is healthy? If an application fails to start, I want my playbook to fail immediately. There are a couple of ways to accomplish this. One way is to check the application logs for a health string using wait_for log searching:

    - name: Checking the application logs for the healthy keyword
      wait_for:
        path: /var/log/app1/app.log
        search_regex: "healthy"
        state: present
      changed_when: False
      when: sentinel_exists.stat.exists

To ensure this works correctly, you need to be sure the logs only contain entries from the time the host was rebooted. Tools like logrotate can help with this. You can take this one step further and run an application health check command:

    - name: Check to make sure app1 is happy
      shell: |
        while true; do
            /usr/local/bin/test-my-app
            RC=$?

            # test-my-app returns 1 when healthy, 0 otherwise.
            if [[ $RC -eq 0 ]]; then
                echo "Waiting for app1 to become healthy"
                sleep 1
            else
                echo "App1 is reporting a healthy status"
                exit 0
            fi
        done
      async: 600
      poll: 5
      when: sentinel_exists.stat.exists

The test-my-app command will be invoked in a loop, and will hopefully spit out a success code prior to the async timeout (600 seconds in the example above) interval expiring. These features are incredibly powerful, and have helped set my mind at ease when performing complex updates in production. If you think this material is useful, you may be interested in my article using the ansible uri module to test web services during playbook execution.

This article was posted by on 2020-04-11 01:00:00 -0500 -0500