I’ve become a huge fan of Ansible’s async support. This is incredibly useful for performing synthetic health tests on services after a task completes. A common use case is patching a server that hosts one or more applications. You ideally want to gracefully stop your application, update packages, reboot, and then test the application to ensure its healthy. Updating the system and scheduling the reboot is easy to accomplish with serial, reboot and a sentinel creation script:
hosts: "{{ server_list }}"
become: true
serial: 1
tasks:
- name: Create the restart sentinel script
copy:
src: ../files/restart-sentinel
dest: /usr/local/bin/restart-sentinel
mode: '0700'
owner: 'root'
group: 'root'
- name: Upgrade Operating System packages
yum:
name: '*'
state: latest
register: yum_updates_applied
- name: Run the create-sentinel script to see if an update is required
command: /usr/local/bin/restart-sentinel
when: yum_updates_applied.changed
- name: Check to see if the restart sentinel file exists
stat:
path: /run/reboot-required
register: sentinel_exists
- name: Gracefully stop app1
systemd:
name: app1
state: stopped
when: sentinel_exists.stat.exists
- name: Check to make sure app1 exited
shell: |
while true; do
count=$(ps auxw | grep -c [a]pp1)
if [[ $count -ne 0 ]]; then
echo "App1 is still running"
sleep 1
else
echo "All app1 processes are down"
exit 0
fi
done
async: 600
poll: 5
when: sentinel_exists.stat.exists
- name: Reboot the server to pick up a kernel update
reboot:
reboot_timeout: 600
when: sentinel_exists.stat.exists
Once the server reboots, how do you know your application is healthy? If an application fails to start, I want my playbook to fail immediately. There are a couple of ways to accomplish this. One way is to check the application logs for a health string using wait_for log searching:
- name: Checking the application logs for the healthy keyword
wait_for:
path: /var/log/app1/app.log
search_regex: "healthy"
state: present
changed_when: False
when: sentinel_exists.stat.exists
To ensure this works correctly, you need to be sure the logs only contain entries from the time the host was rebooted. Tools like logrotate can help with this. You can take this one step further and run an application health check command:
- name: Check to make sure app1 is happy
shell: |
while true; do
/usr/local/bin/test-my-app
RC=$?
# test-my-app returns 1 when healthy, 0 otherwise.
if [[ $RC -eq 0 ]]; then
echo "Waiting for app1 to become healthy"
sleep 1
else
echo "App1 is reporting a healthy status"
exit 0
fi
done
async: 600
poll: 5
when: sentinel_exists.stat.exists
The test-my-app command will be invoked in a loop, and will hopefully spit out a success code prior to the async timeout (600 seconds in the example above) interval expiring. These features are incredibly powerful, and have helped set my mind at ease when performing complex updates in production. If you think this material is useful, you may be interested in my article using the ansible uri module to test web services during playbook execution.