I’ve become a huge fan of Ansible’s async support. This is incredibly useful for performing synthetic health tests on services after a task completes. A common use case is patching a server that hosts one or more applications. You ideally want to gracefully stop your application, update packages, reboot, and then test the application to ensure its healthy. Updating the system and scheduling the reboot is easy to accomplish with serial, reboot and a sentinel creation script:
hosts: "{{ server_list }}"
become: true
serial: 1
tasks:
- name: Create the restart sentinel script
copy:
src: ../files/restart-sentinel
dest: /usr/local/bin/restart-sentinel
mode: '0700'
owner: 'root'
group: 'root'
- name: Upgrade Operating System packages
yum:
name: '*'
state: latest
register: yum_updates_applied
- name: Run the create-sentinel script to see if an update is required
command: /usr/local/bin/restart-sentinel
when: yum_updates_applied.changed
- name: Check to see if the restart sentinel file exists
stat:
path: /run/reboot-required
register: sentinel_exists
- name: Gracefully stop app1
systemd:
name: app1
state: stopped
when: sentinel_exists.stat.exists
- name: Check to make sure app1 exited
shell: |
while true; do
count=$(ps auxw | grep -c [a]pp1)
if [[ $count -ne 0 ]]; then
echo "App1 is still running"
sleep 1
else
echo "All app1 processes are down"
exit 0
fi
done
async: 600
poll: 5
when: sentinel_exists.stat.exists
- name: Reboot the server to pick up a kernel update
reboot:
reboot_timeout: 600
when: sentinel_exists.stat.exists
Once the server reboots, how do you know your application is healthy? If an application fails to start, I want my playbook to fail immediately. There are a couple of ways to accomplish this. One way is to check the application logs for a health string using wait_for log searching:
- name: Checking the application logs for the healthy keyword
wait_for:
path: /var/log/app1/app.log
search_regex: "healthy"
state: present
changed_when: False
when: sentinel_exists.stat.exists
To ensure this works correctly, you need to be sure the logs only contain entries from the time the host was rebooted. Tools like logrotate can help with this. You can take this one step further and run an application health check command:
- name: Check to make sure app1 is happy
shell: |
while true; do
/usr/local/bin/test-my-app
RC=$?
# test-my-app returns 1 when healthy, 0 otherwise.
if [[ $RC -eq 0 ]]; then
echo "Waiting for app1 to become healthy"
sleep 1
else
echo "App1 is reporting a healthy status"
exit 0
fi
done
async: 600
poll: 5
when: sentinel_exists.stat.exists
The test-my-app command will be invoked in a loop, and will hopefully spit out a success code prior to the async timeout (600 seconds in the example above) interval expiring. These features are incredibly powerful, and have helped set my mind at ease when performing complex updates in production. If you think this material is useful, you may be interested in my article using the ansible uri module to test web services during playbook execution.
This past weekend I spent some time revamping a few playbooks. One of my playbooks was taking a while to run, and I wanted to see how much actual time was spent in each task. Luckily for me, Ansible has a profiling module to help with this. To enable it, you can add the following directive to the default section in your ansible.cfg configuration file:
callback_whitelist = profile_tasks
Future playbook runs will result in a timestamp being printed for each task:
TASK [pihole : Adding port 53 to firewalld] ***********************************
Saturday 11 April 2020 09:22:27 -0400 (0:00:00.080) 0:00:03.630 ********
ok: [127.0.0.1]
And a global summary will be produced once the playbook completes:
Saturday 11 April 2020 09:22:28 -0400 (0:00:00.326) 0:00:04.486 ********
===============================================================================
Gathering Facts --------------------------------------------------------- 1.09s
pihole : Adding port 53 to firewalld ------------------------------------ 0.53s
pihole : Pulling down the pihole docker image --------------------------- 0.44s
pihole : Creating hosts file -------------------------------------------- 0.39s
pihole : Get list of blacklisted domains -------------------------------- 0.35s
.....
Super useful feature, especially when you are trying to shave time off complex playbook runs.
When I first started with Kubernetes, it took me some time to understand two things. One, how do I generate manifests to run my service. I tackled this in a previous blog post. The second was wrapping my head around RBAC policies. Roles, Bindings, Verbs, OH MY! After a bit of research, I understood how RBAC worked, but who wants to generate RBAC policy from scratch? Ouch!
Luckily my research turned up an amazing tool, audit2rbac, which can generate RBAC policies from Kubernetes audit logs. This is now my go to solution for creating initial RBAC policies. When I need to create an RBAC policy, I will spin up a kind cluster with auditing enabled, run the workload, and then process the audit logs with audit2rbac. This will give me an initial RBAC policy, which I can then refine to suit my needs.
Audit2rbac works with Kubernetes audit logs. To enable auditing, you can pass one or more audit flags to the API server. For a test kind cluster, the following flags have served me well:
- --audit-log-format=json
- --audit-policy-file=/etc/kubernetes/pki/policy
- --audit-log-path=-
- --audit-log-maxsize=1
You will also need to create an audit policy document. This example is a good place to start. Once auditing is enabled, you should see entries similar to the following in the API server audit log (the path to the log is controlled with the “–audit-log-path=” option)
2020-01-28T19:35:45.020478035Z stdout F {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"87f75541-f426-44ed-baeb-c7259ccd4dbf","stage":"ResponseComplete","requestURI":"/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/audit-control-plane?timeout=10s","verb":"update","user":{"username":"system:node:audit-control-plane","groups":["system:nodes","system:authenticated"]},"sourceIPs":["172.17.0.2"],"userAgent":"kubelet/v1.16.3 (linux/amd64) kubernetes/b3cbbae","objectRef":{"resource":"leases","namespace":"kube-node-lease","name":"audit-control-plane","uid":"659f94a2-62c9-4d02-8637-e02f50d5945f","apiGroup":"coordination.k8s.io","apiVersion":"v1","resourceVersion":"7839"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2020-01-28T19:35:45.019014Z","stageTimestamp":"2020-01-28T19:35:45.020336Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
To generate an RBAC policy with audit2rbac, you will need to run your service, or invoke one or more kubectl commands to generate audit events. We can run kubectl to see how this process works:
$ kubectl get pod
The kubectl get will cause a number of audit log events to be generated. If you are using kind, you can export therse logs with the export command:
$ kind export logs /tmp/audit --name audit
Once the logs are exported, we need to remove everything from the events other than the JSON object:
$ cat /tmp/audit/*control*/containers/*api* | grep Event | sed 's/^.* F //' > audit.log'
Now that we have a log full of JSON audit events, we can run audit2rbac specifying the user or service account to audit:
$ audit2rbac -f audit.log --user kubernetes-admin
This will produce YAML similar to the following:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
annotations:
audit2rbac.liggitt.net/version: v0.7.0
labels:
audit2rbac.liggitt.net/generated: "true"
audit2rbac.liggitt.net/user: kubernetes-admin
name: audit2rbac:kubernetes-admin
namespace: default
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
annotations:
audit2rbac.liggitt.net/version: v0.7.0
labels:
audit2rbac.liggitt.net/generated: "true"
audit2rbac.liggitt.net/user: kubernetes-admin
name: audit2rbac:kubernetes-admin
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: audit2rbac:kubernetes-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: kubernetes-admin
---
This is super useful! No more cut & pasting RBAC YAML to create an initial RBAC policy. The YAML that is produced gives you a good understanding of what is needed to restrict access, and can be adjusted to meet your security requirements. The following Youtube video contains a super cool demo showing what audit2rbac can do:
Defintely worth watching!
In a previous post, I discussed using the Kubernetes external-dns project to manage DNS changes. Prior to rolling it out, I needed a way to backup each zone prior to external-dns modifying it. I also wanted this to occur each time a commit occurred that resulted in a DNS change. This turned out to be super easy to do with the aws CLI. To export all records in a zone, you will first need to locate the zone id. You can get this with the “list-hosted-zones” command:
$ aws --profile me route53 list-hosted-zones
{
"HostedZones": [
{
"Id": "/hostedzone/XXXXXXXXXXX",
"Name": "prefetch.net.",
"CallerReference": "XXXXXXXX",
"Config": {
"Comment": "HostedZone created by Route53 Registrar",
"PrivateZone": false
},
"ResourceRecordSetCount": 2
}
]
}
Once you have the id you can export the records with the “list-resource-record-sets” command:
$ aws --profile me route53 list-resource-record-sets --hosted-zone-id iXXXXXXXXXXX
This will produce a JSON object which you can stash in a safe location. If something were to happen to your route53 zone, you can use the “change-resource-record-sets” command along with the last JSON object to restore it to a known state. Nice!
I love AWS, but when I’m debugging issues I prefer the Linux command line over Cloudwatch Logs Insights. Numerous AWS services store their log configuration inside cloudwatch, which presents a small challenge since my tooling (ripgrep, awk, jq, sed, etc.) can’t directly access cloudwatch logs. The aws command line has a nifty get-log-events option which can solve this problem. It allows you to export logs from a log stream, and has several options to control what gets exported.
The following example shows how to export logs from the stream named LOG_STREAM_NAME, but only events that occurred in the past five minutes:
$ aws --profile la --region us-east-1 logs get-log-events --log-group-name LOG_GROUP_NAME --log-stream-name LOG_STREAM_NAME --start-time $(date "+%s%N" -d "5 minutes ago" | cut -b1-13) > /tmp/log
The export will contain one or more JSON objects similar to the following:
{
"events": [
{
"timestamp": 1580312457000,
"message": "I0129 15:40:57.094248 1 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io",
"ingestionTime": 1580312464749
}
}
Get-log-events also has a “create-export-task” option, which can be used to export logs to an S3 bucket. This can be useful for archiving logs for offline debugging. The aws CLI is super useful, and the more I work with it the more I like it!