Over the past several years I’ve read numerous horror stories about cloud deployments gone wrong. S3 buckets with PCI data left open to the raw Internet, EC2 instance profiles that weren’t scoped properly, misconfigured NSGs, etc. It takes a LOT of time to truly understand all the ins and outs of running workloads in the cloud, and making sure you get it “right”. This is one reason I’m always on the lookout for tools that can add additional guard rails to the infrastructure provisioning process.
One super cool tool I came across 6-months ago is is terrascan. This awesome piece of software can help you find security and policy violations in your IAC repos, and help ensure that the most common problems are corrected before they become actual problems.
Using terracan is a breeze. You can run the terrascan binary with the scan option, and point it to the directory to scan with the “–iac-dir” option:
$ terrascan scan --iac-dir environment/testing/services/vault
Violation Details -
Description : Ensure that detailed monitoring is enabled for EC2 instances.
File : ../../../../terraform/modules/aws/ec2-instance/main.tf
Module Name : ec2_instances
Plan Root : ./
Line : 1
Severity : HIGH
-----------------------------------------------------------------------
Description : EC2 instances should disable IMDS or require IMDSv2 as this can be related to the weaponization phase of kill chain
File : ../../../../terraform/modules/aws/ec2-instance/main.tf
Module Name : ec2_instances
Plan Root : ./
Line : 1
Severity : MEDIUM
-----------------------------------------------------------------------
Scan Summary -
File/Folder : ./environments/testing/services/vault
IaC Type : terraform
Scanned At : 2022-05-08 18:52:07.496177557 +0000 UTC
Policies Validated : 5
Violated Policies : 2
Low : 0
Medium : 1
High : 1
The output will contain a list of violations, ranked from HIGH to LOW. Each violation is documented on the terrascan website, which also contains links to reference material. Terrascan policies are written in Rego, so you can easily extend the base functionality with custom policies. It also integrates easily with CI, so you get an extra set of eyes for free!
Having worked in the “cloud” for several years, one thing that I’m super conscious about is our cloud bill. There are tons of subtleties associated with billing, such as AZ-to-AZ traffic costs or how VPC endpoints can reduce egress charges. If you utilize Terraform for infrastructure provisioning, you may want to look at infracost. Infracost can help you understand cloud spend for a green field deployment, or what it will cost to expand your existing infrastructure. While it can’t model costs exactly, what it can do is help you approximate what a given infrastructure change will do your wallet.
Getting started with infracost is super easy. You will first need to run $(infracost register) to get an API token. Once you have this, you will need to set the INFRACOST_API_KEY environment variable with the token you received. This token is used to talk to the infracost cloud pricing APIs. To view a full cost estimate for a service, you can run infracost with the breakdown command, the “–path” option, and the directory that contains the Terraform HCL you want to analyze:
$ infracost breakdown --path .
Detected Terraform directory at .
✔ Checking for cached plan... expired
✔ Running terraform plan
✔ Running terraform show
✔ Extracting only cost-related params from terraform
✔ Retrieving cloud prices to calculate costs
Project: .
Name Monthly Qty Unit Monthly Cost
module.eks_control_plane.aws_cloudwatch_log_group.eks_logging
├─ Data ingested Monthly cost depends on usage: $0.50 per GB
├─ Archival Storage Monthly cost depends on usage: $0.03 per GB
└─ Insights queries data scanned Monthly cost depends on usage: $0.005 per GB
module.eks_control_plane.aws_eks_cluster.eks_cluster
└─ EKS cluster 730 hours $73.00
module.eks_node_group_api.aws_eks_node_group.eks_node_group
├─ Instance usage (Linux/UNIX, on-demand, m5.large) 1,460 hours $140.16
└─ Storage (general purpose SSD, gp2) 40 GB $4.00
module.eks_node_group_ingress.aws_eks_node_group.eks_node_group
├─ Instance usage (Linux/UNIX, on-demand, m5.large) 730 hours $70.08
└─ Storage (general purpose SSD, gp2) 20 GB $2.00
OVERALL TOTAL $289.24
──────────────────────────────────
24 cloud resources were detected:
∙ 4 were estimated, 3 of which include usage-based costs, see https://infracost.io/usage-file
∙ 20 were free, rerun with --show-skipped to see details
If you are making a change to a root module and want to see how that will impact your bill, you can run infracost with the diff command and the “–path” option:
$ infracost diff --path .
Detected Terraform directory at .
✔ Checking for cached plan... change detected
✔ Running terraform plan
✔ Running terraform show
✔ Extracting only cost-related params from terraform
✔ Retrieving cloud prices to calculate costs
Project: .
~ module.eks_node_group_api.aws_eks_node_group.eks_node_group
+$577 ($144 → $721)
~ Instance usage (Linux/UNIX, on-demand, m5.large)
+$561 ($140 → $701)
~ Storage (general purpose SSD, gp2)
+$16.00 ($4.00 → $20.00)
Monthly cost change for .
Amount: +$577 ($289 → $866)
Percent: +199%
──────────────────────────────────
Key: ~ changed, + added, - removed
24 cloud resources were detected:
∙ 4 were estimated, 3 of which include usage-based costs, see https://infracost.io/usage-file
∙ 20 were free, rerun with --show-skipped to see details
The estimates that infracost provides can be included along with your plan in a pull request, and are invaluable for finding issues (e.g., using a G3 instance instead of an M3) before you apply your plan to a given environment. Infracost gets even more useful when you integrate it with your favorite CI tool. Then when you submit your plan through a pull request (or through a CI job), infracost will run behind the scenes and attach a cost estimate to your PR. The layer of protection infracost provides is worth its weight in gold!
If you decide to use infracost I would highly suggest signing up for an Enterprise account. It costs the team at infracost time and money to make their APIs available. Spending a few bucks each month will help them improve their product, and give you access to advanced features and support. And for the record, I have no affiliations with infracost. I just love the features it provides!
The growth of the Terraform community is absolutely astounding. New providers are constantly popping up, providers are being upgraded at a feverish pace, and amazing new features are constantly being added. With all of this change, deprecations and breaking changes periodically surface. One way to protect yourself from breaking changes is to pin providers and modules to specific versions. You can accomplish this by adding specific git hashes or tags to your source statements, and by adding version directives to your provider definitions:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "4.9.0"
}
.....
}
}
Terraform also has a required_version configuration directive to pin the version of Terraform you use:
terraform {
required_version = "1.0.8"
.....
}
}
With proper testing, you can propagate new versions between environments and catch breaking changes very early on in the process. One super cool utility that helps with this is tfswitch. Tfswitch can be used to manage a collection of terraform binaries, and “activate” the version defined in the required_version block. Activating a specific version is as easy as changing into the directory you want to work in, and running tfswitch:
$ cd environments/testing/services/vault
$ tfswitch
Reading required version from terraform file
Reading required version from constraint: 1.0.11
Matched version: 1.0.11
Installing terraform at /home/matty/bin
Switched terraform to version "1.0.11"
Tfswitch will download the version of Terraform that is specified in required_version if it doesn’t exist, and that will become accessible through your $PATH. Behind the scenes, tfswitch will place versioned terraform binaries in $HOME/.terraform.versions. The version specified in your HCL will be symbolically linked to $HOME/bin (or the location passed to “–bin”). We can see this with ls:
$ ls -la $HOME/bin/terraform
lrwxrwxrwx. 1 matty matty 48 Apr 8 15:43 /home/matty/bin/terraform -> /home/matty/.terraform.versions/terraform_1.0.11
Super useful utility, and makes working with multiple environments a bit easier.
When I was first getting started with Kubernetes, RBAC was one of the topics that took me the longest to grok. Not because the resources (Roles, ClusterRoles, etc) are hard to interpret, but learning how to scope your Roles to minimize access takes some practice. That and a lot of reading to understand the various API groups and what they contain.
In a previous post I mentioned access-matrix, which is an incredible tool for visualizing the RBAC permissions an entity (User, SA, Group, etc.) has. One other super useful tool for debugging RBAC is the kubectl auth “can-i” subcommand. When run with your typical kubectl command line, it will tell you if you are authorized to perform the operation.
The following example shows how to check if the webapp ServiceAccount can list pods in the default namespace:
$ kubectl auth can-i get po --as system:serviceaccount:default:webapp
no
This becomes your best friend when you are fine tuning your Roles to avoid getting 401s back from the API server. No one (other than security conscious admins) likes messages similar to the following:
$ kubectl delete po mypod --as=system:serviceaccount:default:webapp
Error from server (Forbidden): pods "mypod" is forbidden: User "system:serviceaccount:default:webapp" cannot delete resource "pods" in API group "" in the namespace "default"
It is definitely worth putting the time into studying RBAC inside and out, and the two tools mentioned above can make this process much more enjoyable.
Debugging production issues can sometimes be a challenge in Kubernetes environments. One specific challenge is debugging containers that don’t contain a shell. You may have seen the following when troubleshooting an issue:
$ kubectl exec -it -n kube-system coredns-558bd4d5db-gx469 -- sh
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "4f053952703f78b51bdf38a26ed391d8c2bda4138b87f35170d3fc4ea14fc510": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "sh": executable file not found in $PATH: unknown
Not including a shell in your base image is a best practice, and projects like distroless make it super easy to package your applications with a small shell-less footprint. But when apps go rogue, what options do we have to debug them if the container doesn’t include a shell?
If you have shell access to the Kubernetes node the pod is running on, nsenter and the binaries on that host are a great way to debug problems. But what if you don’t have access to the node? Like in some managed Kubernetes services? In this case ephemeral containers and $(kubectl debug) may be a good option for you.
Ephemeral container support went into beta in 1.23, and is now enabled by default with super recent Kubernetes releases. This nifty feature allows you to spin up a container of your choosing alongside an existing container. Here is an example that creates an Ubuntu container, and attaches it (by placing it in the coredns PIDs namespaces) to a shell-less coredns pod:
$ kubectl debug -n kube-system -it coredns-64897985d-tn4tb --target=coredns --image=ubuntu
Targeting container "coredns". If you don't see processes from this container
it may be because the container runtime doesn't support this feature.
Defaulting debug container name to debugger-vx6mk.
If you don't see a command prompt, try pressing enter.
root@coredns-64897985d-tn4tb:/# ps auxwww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.2 0.5 750840 43488 ? Ssl 14:29 0:01 /coredns -conf /etc/coredns/Corefile
root 22 0.3 0.0 4248 3380 pts/0 Ss 14:39 0:00 bash
root 37 0.0 0.0 5900 2916 pts/0 R+ 14:39 0:00 ps auxwww
root@coredns-64897985d-tn4tb:/# dlv attach 1
Once you are in the debug container, you can install software, load up debuggers, etc. to get to the bottom of your issue. This is especially handy when you remove a problematic pod from a service so it no longer receives traffic. This allows you to debug in isolation, and without the time constraints that are usually associated with broken applications. Super cool feature!