Using the Ansible async module to perform synthetic health tests during playbook runs

This article was posted by on 2020-04-11 01:00:00 -0500 -0500

I’ve become a huge fan of Ansible’s async support. This is incredibly useful for performing synthetic health tests on services after a task completes. A common use case is patching a server that hosts one or more applications. You ideally want to gracefully stop your application, update packages, reboot, and then test the application to ensure its healthy. Updating the system and scheduling the reboot is easy to accomplish with serial, reboot and a sentinel creation script:

hosts: "{{ server_list }}"
  become: true
  serial: 1
  tasks:
    - name: Create the restart sentinel script
      copy:
        src:   ../files/restart-sentinel
        dest:  /usr/local/bin/restart-sentinel
        mode:  '0700'
        owner: 'root'
        group: 'root'

    - name: Upgrade Operating System packages
      yum:
        name:  '*'
        state: latest
     register: yum_updates_applied

    - name: Run the create-sentinel script to see if an update is required
      command: /usr/local/bin/restart-sentinel
      when: yum_updates_applied.changed

    - name: Check to see if the restart sentinel file exists
      stat:
        path: /run/reboot-required
      register: sentinel_exists

    - name: Gracefully stop app1
      systemd:
        name:  app1
        state: stopped
      when: sentinel_exists.stat.exists

    - name: Check to make sure app1 exited
      shell: |
        while true; do
            count=$(ps auxw | grep -c [a]pp1)
            if [[ $count -ne 0 ]]; then
               echo "App1 is still running"
               sleep 1
            else
               echo "All app1 processes are down"
               exit 0
            fi
        done
      async: 600
      poll: 5
      when: sentinel_exists.stat.exists

    - name: Reboot the server to pick up a kernel update
      reboot:
        reboot_timeout: 600
      when: sentinel_exists.stat.exists

Once the server reboots, how do you know your application is healthy? If an application fails to start, I want my playbook to fail immediately. There are a couple of ways to accomplish this. One way is to check the application logs for a health string using wait_for log searching:

    - name: Checking the application logs for the healthy keyword
      wait_for:
        path: /var/log/app1/app.log
        search_regex: "healthy"
        state: present
      changed_when: False
      when: sentinel_exists.stat.exists

To ensure this works correctly, you need to be sure the logs only contain entries from the time the host was rebooted. Tools like logrotate can help with this. You can take this one step further and run an application health check command:

    - name: Check to make sure app1 is happy
      shell: |
        while true; do
            /usr/local/bin/test-my-app
            RC=$?

            # test-my-app returns 1 when healthy, 0 otherwise.
            if [[ $RC -eq 0 ]]; then
                echo "Waiting for app1 to become healthy"
                sleep 1
            else
                echo "App1 is reporting a healthy status"
                exit 0
            fi
        done
      async: 600
      poll: 5
      when: sentinel_exists.stat.exists

The test-my-app command will be invoked in a loop, and will hopefully spit out a success code prior to the async timeout (600 seconds in the example above) interval expiring. These features are incredibly powerful, and have helped set my mind at ease when performing complex updates in production. If you think this material is useful, you may be interested in my article using the ansible uri module to test web services during playbook execution.

Using the profile module to time Ansible playbook runs

This article was posted by on 2020-04-11 00:00:00 -0500 -0500

This past weekend I spent some time revamping a few playbooks. One of my playbooks was taking a while to run, and I wanted to see how much actual time was spent in each task. Luckily for me, Ansible has a profiling module to help with this. To enable it, you can add the following directive to the default section in your ansible.cfg configuration file:

callback_whitelist = profile_tasks

Future playbook runs will result in a timestamp being printed for each task:

TASK [pihole : Adding port 53 to firewalld] ***********************************
Saturday 11 April 2020  09:22:27 -0400 (0:00:00.080)       0:00:03.630 ********
ok: [127.0.0.1]

And a global summary will be produced once the playbook completes:

Saturday 11 April 2020  09:22:28 -0400 (0:00:00.326)       0:00:04.486 ********
===============================================================================
Gathering Facts --------------------------------------------------------- 1.09s
pihole : Adding port 53 to firewalld ------------------------------------ 0.53s
pihole : Pulling down the pihole docker image --------------------------- 0.44s
pihole : Creating hosts file -------------------------------------------- 0.39s
pihole : Get list of blacklisted domains -------------------------------- 0.35s
.....

Super useful feature, especially when you are trying to shave time off complex playbook runs.

Using audit2rbac to create RBAC policies from Kubernetes audit log

This article was posted by on 2020-02-01 00:00:00 -0500 -0500

When I first started with Kubernetes, it took me some time to understand two things. One, how do I generate manifests to run my service. I tackled this in a previous blog post. The second was wrapping my head around RBAC policies. Roles, Bindings, Verbs, OH MY! After a bit of research, I understood how RBAC worked, but who wants to generate RBAC policy from scratch? Ouch!

Luckily my research turned up an amazing tool, audit2rbac, which can generate RBAC policies from Kubernetes audit logs. This is now my go to solution for creating initial RBAC policies. When I need to create an RBAC policy, I will spin up a kind cluster with auditing enabled, run the workload, and then process the audit logs with audit2rbac. This will give me an initial RBAC policy, which I can then refine to suit my needs.

Audit2rbac works with Kubernetes audit logs. To enable auditing, you can pass one or more audit flags to the API server. For a test kind cluster, the following flags have served me well:

- --audit-log-format=json
- --audit-policy-file=/etc/kubernetes/pki/policy
- --audit-log-path=-
- --audit-log-maxsize=1

You will also need to create an audit policy document. This example is a good place to start. Once auditing is enabled, you should see entries similar to the following in the API server audit log (the path to the log is controlled with the “–audit-log-path=” option)

2020-01-28T19:35:45.020478035Z stdout F {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"87f75541-f426-44ed-baeb-c7259ccd4dbf","stage":"ResponseComplete","requestURI":"/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/audit-control-plane?timeout=10s","verb":"update","user":{"username":"system:node:audit-control-plane","groups":["system:nodes","system:authenticated"]},"sourceIPs":["172.17.0.2"],"userAgent":"kubelet/v1.16.3 (linux/amd64) kubernetes/b3cbbae","objectRef":{"resource":"leases","namespace":"kube-node-lease","name":"audit-control-plane","uid":"659f94a2-62c9-4d02-8637-e02f50d5945f","apiGroup":"coordination.k8s.io","apiVersion":"v1","resourceVersion":"7839"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2020-01-28T19:35:45.019014Z","stageTimestamp":"2020-01-28T19:35:45.020336Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}

To generate an RBAC policy with audit2rbac, you will need to run your service, or invoke one or more kubectl commands to generate audit events. We can run kubectl to see how this process works:

$ kubectl get pod

The kubectl get will cause a number of audit log events to be generated. If you are using kind, you can export therse logs with the export command:

$ kind export logs /tmp/audit --name audit

Once the logs are exported, we need to remove everything from the events other than the JSON object:

$ cat /tmp/audit/*control*/containers/*api* | grep Event | sed 's/^.* F //' > audit.log'

Now that we have a log full of JSON audit events, we can run audit2rbac specifying the user or service account to audit:

$ audit2rbac -f audit.log --user kubernetes-admin

This will produce YAML similar to the following:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  annotations:
    audit2rbac.liggitt.net/version: v0.7.0
  labels:
    audit2rbac.liggitt.net/generated: "true"
    audit2rbac.liggitt.net/user: kubernetes-admin
  name: audit2rbac:kubernetes-admin
  namespace: default
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - watch
---

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  annotations:
    audit2rbac.liggitt.net/version: v0.7.0
  labels:
    audit2rbac.liggitt.net/generated: "true"
    audit2rbac.liggitt.net/user: kubernetes-admin
  name: audit2rbac:kubernetes-admin
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: audit2rbac:kubernetes-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: kubernetes-admin

---

This is super useful! No more cut & pasting RBAC YAML to create an initial RBAC policy. The YAML that is produced gives you a good understanding of what is needed to restrict access, and can be adjusted to meet your security requirements. The following Youtube video contains a super cool demo showing what audit2rbac can do:

Defintely worth watching!

Backing up your route53 zone configuration with the aws CLI

This article was posted by on 2020-01-30 00:00:00 -0500 -0500

In a previous post, I discussed using the Kubernetes external-dns project to manage DNS changes. Prior to rolling it out, I needed a way to backup each zone prior to external-dns modifying it. I also wanted this to occur each time a commit occurred that resulted in a DNS change. This turned out to be super easy to do with the aws CLI. To export all records in a zone, you will first need to locate the zone id. You can get this with the “list-hosted-zones” command:

$ aws --profile me route53 list-hosted-zones

{
    "HostedZones": [
        {
            "Id": "/hostedzone/XXXXXXXXXXX",
            "Name": "prefetch.net.",
            "CallerReference": "XXXXXXXX",
            "Config": {
                "Comment": "HostedZone created by Route53 Registrar",
                "PrivateZone": false
            },
            "ResourceRecordSetCount": 2
        }
    ]
}

Once you have the id you can export the records with the “list-resource-record-sets” command:

$ aws --profile me route53 list-resource-record-sets --hosted-zone-id iXXXXXXXXXXX

This will produce a JSON object which you can stash in a safe location. If something were to happen to your route53 zone, you can use the “change-resource-record-sets” command along with the last JSON object to restore it to a known state. Nice!

Exporting AWS Cloudwatch log streams to a local file

This article was posted by on 2020-01-29 11:51:31 -0500 -0500

I love AWS, but when I’m debugging issues I prefer the Linux command line over Cloudwatch Logs Insights. Numerous AWS services store their log configuration inside cloudwatch, which presents a small challenge since my tooling (ripgrep, awk, jq, sed, etc.) can’t directly access cloudwatch logs. The aws command line has a nifty get-log-events option which can solve this problem. It allows you to export logs from a log stream, and has several options to control what gets exported.

The following example shows how to export logs from the stream named LOG_STREAM_NAME, but only events that occurred in the past five minutes:

$ aws --profile la --region us-east-1 logs get-log-events --log-group-name LOG_GROUP_NAME --log-stream-name LOG_STREAM_NAME --start-time $(date "+%s%N" -d "5 minutes ago" | cut -b1-13) > /tmp/log

The export will contain one or more JSON objects similar to the following:

{
    "events": [
        {
            "timestamp": 1580312457000,
            "message": "I0129 15:40:57.094248       1 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io",
            "ingestionTime": 1580312464749
        }
}

Get-log-events also has a “create-export-task” option, which can be used to export logs to an S3 bucket. This can be useful for archiving logs for offline debugging. The aws CLI is super useful, and the more I work with it the more I like it!