I interact with RESTful services daily and periodically need to review the JSON objects exposed through one or more endpoints. There are several Linux utilities that can take a JSON object and print the object in an easily readable form. The pygmentize utility (available in the python-pygments package) can be fed a JSON object via a file or STDIN:
$ curl http://bind:8080/json 2>/dev/null | pygmentize |more
{
"json-stats-version":"1.2",
"boot-time":"2017-09-09T11:56:04.442Z",
"config-time":"2017-09-09T11:56:04.520Z",
"current-time":"2017-09-09T12:10:36.054Z",
"version":"9.11.2",
.....
}
In the output above I’m retrieving a JSON object from the Bind statistics server and feeding it to pygmentize via STDIN. Pygmentize will take the object is given and produce a nightly formatted JSON object on STDOUT.
Pygmentize is super handy but the real power house of the JSON command line processors is jq. This amazing utility has numerous features which allow you to retrieve keys, values, objects and arrays and apply complex filters and operations to these elements. In its simplest form jq will take a JSON object and produce pretty output:
$ curl http://bind:8080/json 2>/dev/null | jq '.' |more
{
"json-stats-version": "1.2",
"boot-time": "2017-09-09T11:56:04.442Z",
"config-time": "2017-09-09T11:56:04.520Z",
"current-time": "2017-09-09T12:24:43.982Z",
"version": "9.11.2",
.....
}
To see the real power of jq we need to observe how to use operations and filters on a JSON object. The Bind statistics server produces a JSON object similar to the following (this was heavily edited to conserve space):
{
"json-stats-version":"1.2",
"boot-time":"2017-09-09T11:56:04.442Z",
"config-time":"2017-09-09T11:56:04.520Z",
"current-time":"2017-09-09T13:40:59.901Z",
"version":"9.11.2",
"codes":{
"NOERROR":7798,
"FORMERR":0,
"SERVFAIL":2,
"NXDOMAIN":166,
"NOTIMP":0,
"REFUSED":161,
"YXDOMAIN":0,
"YXRRSET":0,
"NXRRSET":0,
"NOTAUTH":0,
"NOTZONE":0,
},
"qtypes":{
"A":7023,
"NS", 1,
"PTR":153,
"MX":1,
"AAAA":950
},
.....
}
Lets say you wanted to view the number of A, NX, PTR and MX records queried. We can use the jq filter to grab the qtypes object and pass that through a filter to retrieve the values of the A, NS, PTR and MX keys:
$ curl http://bind:8080/json 2>/dev/null | jq -r '.qtypes| "(.A) (.NS) (.PTR) (.MX)"'
7023 1 153 1
In this example I am using string interpolation to turn the values of A, NS, PTR and MX into a string which is then printed on STDOUT. Jq also has a number of useful math operations which can be applied to the values in a JSON object. To sum the totals of the various failure response codes in the rcodes object we can use the addition operation:
$ curl http://192.168.1.2:8080/json 2>/dev/null | jq -r '.rcodes| .NXDOMAIN + .SERVFAIL + .REFUSED + .FORMERR'
8135
In this example I am retrieving the values of the NXDOMAIN, SERVFAIL, REFUSED and FORMERR keys and sum’ing them with the addition operator. If you are new to jq or JSON I would highly suggest reading the jq manual and introducing JSON. These are excellent resources!
Bind 9.10 introduced a statistics server which exports a number of useful metrics through a web UI, XML and JSON. The statistics server is configured via the “statistics-channels” directive which contains the ip and port to export statistics on and an ACL to control who can read statistics from the server. Here is a sample configuration for reference:
acl "stats_hosts" {
192.168.1.0/24;
};
statistics-channels {
inet 10.10.0.1 port 8080 allow { stats_hosts; };
};
Once the statistics server is enabled you can view the statistics in a web browser by surfing to the IP:PORT the server is configured to export statistics through. To retrieve statistics through XML or JSON you can append “/xml” or “/json” to the URL:
Retrieve statistics through XML:
$ curl http://bind:8080/xml
Retrieve statistics through JSON:
$ curl http://bind:8080/json
The statistics server exports several useful metrics. To view everything you can pipe the output of curl to jq:
$ curl -j http://bind:8080/json 2>/dev/null | jq '.' | more
{
"json-stats-version": "1.2",
"boot-time": "2017-09-10T13:24:35.411Z",
"config-time": "2017-09-10T13:24:35.484Z",
"current-time": "2017-09-10T13:35:44.401Z",
"version": "9.11.2",
"opcodes": {
"QUERY": 389,
"IQUERY": 0,
"STATUS": 0,
.....
If you want to get specific fields you can can adjust the filter passed to jq. To get just the query response codes you can retrieve the rcodes field:
$ curl -j http://bind:8080/json 2>/dev/null | jq '.rcodes'
{
"NOERROR": 307,
"FORMERR": 0,
"SERVFAIL": 0,
"NXDOMAIN": 0,
"NOTIMP": 0,
"REFUSED": 0,
.....
}
To get the types of queries sent to the server you can retrieve qtypes:
$ curl -j http://bind:8080/json 2>/dev/null | jq '.qtypes'
{
"A": 369,
"NS": 1,
"PTR": 1,
"MX": 1,
"AAAA": 11
}
To get overall name server statistics you can grab nsstats:
$ curl -j http://bind:8080/json 2>/dev/null | jq '.nsstats'
{
"Requestv4": 385,
"ReqEdns0": 361,
"RecQryRej": 8,
"Response": 385,
"RespEDNS0": 361,
"QrySuccess": 369,
"QryAuthAns": 17,
"QryNoauthAns": 360,
"QryNxrrset": 8,
"QryRecursion": 3,
"QryFailure": 8,
"QryUDP": 377
}
The statistics server also exports zone data and zone, network and memory statistics. Funneling this data into metricbeats or prometheus and using kibana and grafana to visualize it can provide some amazing insight into your DNS infrastructure.
Over the past month I’ve been evaluating metricbeat. Metricbeat along with the ELK stack are incredibly powerful tools for deriving meaning from metrics and unstructured log data. Metricbeats allows you to funnel system and application metrics (e.g., CPU utilization, number of HTTP GET requests, number of SQL queries, HTTP endpoint response times, etc.) into elasticsearch and the powerful kibana visualization tool can then be used to make sense of it.
To get up and running with metricbeats you will first need to configure your logstash infrastructure to support incoming beats. Once logstash is accepting beats you will need to install the metricbeats daemon on each system you want to collect metrics from. To configure metricbeats you will need to modify the YAML file in /etc/metricbeat/metricbeat.yml.
The first section in metricbeat.yml tells metricbeat which metricsets to collect. Currently there are metricsets for load average, CPU, disk, memory, network and process utilization. To enable a metricset you need to make sure the metric isn’t commented out. The following snippet tells metricbeat to collect CPU, memory and network metrics every 10 seconds:
- module: system
metricsets:
# CPU stats
- cpu
# Memory stats
- memory
# Network stats
- network
enabled: true
period: 10s
processes: ['.*']
Getting the collection period correct is definitely an art. Collecting metrics too frequently will increase system load and can skew the meaning of the metrics you are collecting. Not sampling data often enough can hide short lived problems. You will definitely need to experiment to find the collection interval that is optimal for your environment.
The next section in the file contains one or more outputs. These control where metrics are sent. Metrics can be sent directly to elasticsearch if you don’t need to do any processing. You can also route them to logstash and apply one or more filters to the metrics prior to placing them in an elasticsearch index. The following snippet shows how to ship metrics to elasticsearch over SSL with authentication:
output.elasticsearch:
hosts: ["https://elastic.my.domain:9200"]
username: "metricdata"
password: "WOULDNTYOULIKETOKNOW"
index: "metricbeat"
ssl.certificate_authorities: ["/elk/certs/ca.pem"]
ssl.certificate: "/elk/certs/cert.pem"
ssl.key: "/elk/certs/cert.key"
Once the configuration is in place you can start metricbeat with systemctl:
$ systemctl enable metricbeat && systemctl start metricbeat
If the daemon starts up you will see metric data in the metricbeat index (assuming this is the index you are using for beat data) of the Kibana display. In the next couple of posts I’ll show some of the visualizations I’ve used to track down some really weird problems. It’s AMAZING how easy it is to find issues once all of your metric data is in a single location.
Last month I started playing with elastic’s metricbeat and you can say I fell in love at first beat. I’ve created some amazing visualizations with the metrics it produces and am blown away by how much visibility I can get from correlating disparate event streams. A good example of this is being able to see VMware hypervisor utilization, system utilization and HTTP endpoint latency stacked on top of each other. Elastic hosts a yum metricbeat repository and it’s easy to deploy it to Fedora-derived servers with ansible’s templating capabilities and the yum_repository module.
The following tasks from my metricbeat role will create a metricbeat yum repository configuration file, install the metricbeat package, deploy a templated metricbeat configuration file then enable and start the service:
---
# tasks file for metricbeat
- name: Add metricbeat repository
yum_repository:
name: metricbeat
description: Beats Repo
baseurl: https://artifacts.elastic.co/packages/5.x/yum
gpgkey: https://packages.elastic.co/GPG-KEY-elasticsearch
gpgcheck: yes
enabled: yes
owner: root
group: root
state: present
mode: 0600
- name: Install metricbeat package
package:
pkg={{item}}
state=installed
with_items:
- metricbeat
- name: Create metricbeat configuration file
template:
src: metricbeat.yml.j2
dest: /etc/metricbeat/metricbeat.yml
owner: root
group: root
mode: 0644
- name: Enable the metricbeat systemd service
systemd:
name: metricbeat
enabled: yes
state: started
daemon_reload: yes
I’ve simplified the example to illustrate how easy it is to get up and running with metricbeat. Error handling and conditional restart logic are missing from the example above.
I’ve become a huge fan of ansible’s templating capabilities over the past few months. If you haven’t used them they allow you to control the content of a file that is delivered to a system. The templates can contain variable names which get filled in with well known values, you can use math operations and various filters to derive values, and these can all be wrapped in logic statements to control when and where this occurs.
To illustrate this lets say we are looking to stand up a fault tolerant haproxy cluster and want to use keepalived to control the virtual IPs that float between servers. You could create one configuration file per server and then push these to the appropriate server through the copy module. This violates the duplication anti-pattern and adds more maintenance over the long term. A better approach would be to create one configuration file and fill it in with variables that are unique to each server. These unique variables could be the name of the host, its primary interface, the EC2 auto scaling group, etc. Ansible makes this crazy easy with templates.
Ansible templates are built on top of the amazing Jinja2 templating language. The language allows you to do things like format data, perform math and set operations, calculate random values, fill in variables if a logic operation succeeds etc. I won’t go into any additional detail on the language since the official ansible and Jinja2 documentation are solid!
Now back to the keepalived example. Lets say I want to create a unique keepalived.conf on each server. The ansible template module can take a template file we created, process it with Jinja2 and then spit out a unique configuration file on each server. A template module task takes the following basic form:
- name: Create keepalived configuration file
template:
src: keepalived.conf.j2
dest: /etc/keepalived/keepalived.conf
owner: root
group: root
mode: 0600
In this example the template file named keepalived.conf.j2 will be processed and a file named /etc/keepalived/keepalived.conf will be created on the server or servers this play was run against. The keepalived.conf.j2 file is a standard configuration file which contains variables (enclosed in mustaches) and logic (e.g., for foo in bar …). To fill in the keepalived.conf global_defs section we can create a couple of variables to define well known values (this gets powerful when you define variables in one place and use them throughout your roles and playbooks):
keepalived_email_to: root
keepalived_email_from: root
keepalived_smtp_server: localhost
These can be combined with the well know ansible_fqdn variable to give something similar to this:
global_defs {
# Send an e-mail to each of the following
# addresses when a failure occurs
notification_email {
{{ keepalived_email_to }}
}
# The address to use in the From: header
notification_email_from {{ keepalived_email_from }}
# The SMTP server to route mail through
smtp_server {{ keepalived_smtp_server }}
# How long to wait for the mail server to respond
smtp_connect_timeout 30
# A descriptive name describing the router
router_id vrrp-director-{{ ansible_fqdn }}
}
If this play was run against a server named haproxy01 we would get the
following global configuration:
global_defs {
# Send an e-mail to each of the following
# addresses when a failure occurs
notification_email {
root
}
# The address to use in the From: header
notification_email_from root
# The SMTP server to route mail through
smtp_server localhost
# How long to wait for the mail server to respond
smtp_connect_timeout 30
# A descriptive name describing the router
router_id vrrp-director-haproxy01
}
That’s handy, and allows you to create a single configuration file with unique values for each system. To continue on with our HA keepalived setup lets say it needs to manage two virtual IP addresses and we want each server to master one IP address. Once again we could hard code the values in multiple configuration files or we can use a bit of logic to create unique vrrp_instances for each server. The following snippet shows an example of this:
{% for ip_address in vars['keepalived_virtual_ipaddresses'] %}
# Create a VRRP instance
vrrp_instance vrrp-director-{{ ansible_fqdn }} {
# The initial state to transition to. This option isn't
# really all that valuable, since an election will occur
# and the host with the highest priority will become
# the master. The priority is controlled with the priority
# configuration directive.
state MASTER
# The interface keepalived will manage
interface {{ ansible_default_ipv4.interface }}
{% set router_id = keepalived_initial_router_id + loop.index %}
# The virtual router id number to assign the routers to
virtual_router_id {{ router_id }}
{% set node1 = groups["haproxyservers"][0:1] | join(" ") %}
{% set node2 = groups["haproxyservers"][1:2] | join(" ") %}
{% if loop.index % 2 %}
{% if inventory_hostname == node1 %}
{% set priority = 1 %}
{% else %}
{% set priority = 2 %}
{% endif %}
{% else %}
{% if inventory_hostname == node2 %}
{% set priority = 1 %}
{% else %}
{% set priority = 2 %}
{% endif %}
# The priority to assign to this device. This controls
# who will become the MASTER and BACKUP for a given
# VRRP instance.
priority {{ priority }}
# How many seconds to wait until a gratuitous arp is sent
garp_master_delay 10
# How often to send out VRRP advertisements
advert_int 1
# Execute a notification script when a host transitions to
# MASTER or BACKUP, or when a fault occurs. The arguments
# passed to the script are:
# $1 - "GROUP"|"INSTANCE"
# $2 = name of group or instance
# $3 = target state of transition
# Sample: VRRP-notification.sh VRRP_ROUTER1 BACKUP 100
# notify "/usr/local/bin/VRRP-notification.sh"
# Send an SMTP alert during a state transition
smtp_alert
# Authenticate the remote endpoints via a simple
# username/password combination
authentication {
auth_type AH
auth_pass {{ keepalived_auth_key }}
}
# The virtual IP addresses to float between nodes. The
# label statement can be used to bring an interface
# online to represent the virtual IP.
virtual_ipaddress {
{{ ip_address }}/32 dev {{ ansible_default_ipv4.interface }}
}
}
{% endfor %}
In the output above I am iterating over one of more IPs defined in the keepalived_virtual_ipaddresses variable and building a unique vrrp_instance stanza for each one. The physical interfaces are assigned based on the value of the well known ansible_default_ipv4.interface variable, the virtual_router_id is assigned dynamically for each stanza and the priority value (this controls who owns the IP initially) is generated on the fly based on a modulus operation. I’m still learning everything there is to know about Jinja2 and I’m sure I will refactor this in a couple of months once I come across a better way to do this. This blog post is more of a reference to myself than anything else.