Using elastic’s metricbeat to collect system utilization data

Over the past month I’ve been evaluating metricbeat. Metricbeat along with the ELK stack are incredibly powerful tools for deriving meaning from metrics and unstructured log data. Metricbeats allows you to funnel system and application metrics (e.g., CPU utilization, number of HTTP GET requests, number of SQL queries, HTTP endpoint response times, etc.) into elasticsearch and the powerful kibana visualization tool can then be used to make sense of it.

To get up and running with metricbeats you will first need to configure your logstash infrastructure to support incoming beats. Once logstash is accepting beats you will need to install the metricbeats daemon on each system you want to collect metrics from. To configure metricbeats you will need to modify the YAML file in /etc/metricbeat/metricbeat.yml.

The first section in metricbeat.yml tells metricbeat which metricsets to collect. Currently there are metricsets for load average, CPU, disk, memory, network and process utilization. To enable a metricset you need to make sure the metric isn’t commented out. The following snippet tells metricbeat to collect CPU, memory and network metrics every 10 seconds:

- module: system
  metricsets:
    # CPU stats
    - cpu
    # Memory stats
    - memory
    # Network stats
    - network
  enabled: true
  period: 10s
  processes: ['.*']

Getting the collection period correct is definitely an art. Collecting metrics too frequently will increase system load and can skew the meaning of the metrics you are collecting. Not sampling data often enough can hide short lived problems. You will definitely need to experiment to find the collection interval that is optimal for your environment.

The next section in the file contains one or more outputs. These control where metrics are sent. Metrics can be sent directly to elasticsearch if you don’t need to do any processing. You can also route them to logstash and apply one or more filters to the metrics prior to placing them in an elasticsearch index. The following snippet shows how to ship metrics to elasticsearch over SSL with authentication:

output.elasticsearch:
  hosts: ["https://elastic.my.domain:9200"]
  username: "metricdata"
  password: "WOULDNTYOULIKETOKNOW"
  index: "metricbeat"
  ssl.certificate_authorities: ["/elk/certs/ca.pem"]
  ssl.certificate: "/elk/certs/cert.pem"
  ssl.key: "/elk/certs/cert.key"

Once the configuration is in place you can start metricbeat with systemctl:

$ systemctl enable metricbeat && systemctl start metricbeat

If the daemon starts up you will see metric data in the metricbeat index (assuming this is the index you are using for beat data) of the Kibana display. In the next couple of posts I’ll show some of the visualizations I’ve used to track down some really weird problems. It’s AMAZING how easy it is to find issues once all of your metric data is in a single location.

Install metricbeats with ansible and the elastic yum repository

Last month I started playing with elastic’s metricbeat and you can say I fell in love at first beat. I’ve created some amazing visualizations with the metrics it produces and am blown away by how much visibility I can get from correlating disparate event streams. A good example of this is being able to see VMware hypervisor utilization, system utilization and HTTP endpoint latency stacked on top of each other. Elastic hosts a yum metricbeat repository and it’s easy to deploy it to Fedora-derived servers with ansible’s templating capabilities and the yum_repository module.

The following tasks from my metricbeat role will create a metricbeat yum repository configuration file, install the metricbeat package, deploy a templated metricbeat configuration file then enable and start the service:

---
# tasks file for metricbeat
- name: Add metricbeat repository
  yum_repository:
    name: metricbeat
    description: Beats Repo
    baseurl: https://artifacts.elastic.co/packages/5.x/yum
    gpgkey: https://packages.elastic.co/GPG-KEY-elasticsearch
    gpgcheck: yes
    enabled: yes
    owner: root
    group: root
    state: present
    mode: 0600

- name: Install metricbeat package
  package:
    pkg={{item}}
    state=installed
  with_items:
    - metricbeat

- name: Create metricbeat configuration file
  template:
    src: metricbeat.yml.j2
    dest: /etc/metricbeat/metricbeat.yml
    owner: root
    group: root
    mode: 0644

- name: Enable the metricbeat systemd service
  systemd:
    name: metricbeat
    enabled: yes
    state: started
    daemon_reload: yes

I’ve simplified the example to illustrate how easy it is to get up and running with metricbeat. Error handling and conditional restart logic are missing from the example above.

Troubleshooting a bizarre logstash CPU problem

This week I was converting some legacy shell scripts to ansible roles and wandered into a bizarre issue with one of my elasticsearch servers. After committing a couple of changes my CI system rejected the commit due to a system resource issue. When I logged into the system to troubleshoot the issue I noticed the CPU was pegged:

$ mpstat 1

Linux 3.10.0-514.26.2.el7.x86_64 (elastic02) 	09/02/2017 	_x86_64_	(2 CPU)

11:32:59 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:33:00 AM  all   94.03    0.00    3.98    0.00    0.00    0.00    0.00    0.00    0.00    1.99
11:33:01 AM  all   94.00    0.00    3.50    0.00    0.00    0.00    0.00    0.00    0.00    2.50
11:33:02 AM  all   93.50    0.00    5.50    0.00    0.00    0.00    0.00    0.00    0.00    1.00

This system is used solely to test changes so it should have been 100% idle. Htop showed the Logstash java process as the top CPU consumer so I ventured off to the log directory to see if anything was awry. Sure enough, Logstash was complaining about a missing file:

[2017-09-02T11:37:43,884][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Something is wrong with your configuration."}
[2017-09-02T11:37:52,529][ERROR][logstash.filters.geoip   ] Invalid setting for geoip filter plugin:

  filter {
    geoip {
      # This setting must be a path
      # File does not exist or cannot be opened /elk/logstash/geoip/GeoLite2-City.mmdb
      database => "/elk/logstash/geoip/GeoLite2-City.mmdb"
      ...
    }
  }

One of my commits automated the GeoIP database download process. This subtle change contained a bug (more on that in a future post) which was preventing the file from being downloaded and extracted. By default, logstash should exit if it detects a configuration issue. It shouldn’t spin on a CPU. To see what was going on I fired up my good friend strace and attmpted to attch to the PID I retrieved from ps:

$ strace -f -p 4288

strace: attach: ptrace(PTRACE_ATTACH, ...): No such process

Now that is odd. When I ran ps again and looked closer at the pid column I noticed that the java process had a new PID. I ran it a second time and the pid changed again. So it appeared that logstash was dying and then being restarted over and overr. The Logstash service on this machine is managed by systemd and it immediately dawned on me that the “Restart=on-failure” unit file configuration directive (this will restart a failed process) was the cause of my troubles! After a bit of troubleshooting I located the problem with my ansible role, corrected it and low and behold the system load returned to normal. Bugs and cascading failures happen and this problem never made its way to production because my automated test suite let me know that something was wrong. Thank you jenkins!

How elasticsearch bootstrap checks affect development and production mode

One of my friends reached out to me earlier this week to help him with an elasticsearch issue. He was trying to bring up a new cluster to see how ES compares to splunk and was getting a “bootstrap checks failed” error at startup. This was causing his elasticsearch java processes to bind to localhost instead of the hostname he assigned to the network.host value. Here is a snippet of what I saw when I reviewed the logs:

[2017-08-19T11:31:25,457][ERROR][o.e.b.Bootstrap          ] [elastic01] node validation exception
[2] bootstrap checks failed
[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Elasticsearch has two modes of operation: development and production. In development mode elasticsearch will bind to localhost allowing you to tinker with settings, test features and break things without impacting other nodes on your network. In production mode elasticsearch will bind to an external interface allowing it to communicate with other nodes and form clusters. Elasticsearch runs a number of bootstrap checks to help it figure out which mode to operate in. These checks are put in place to protect you server from data corruption and network partitions which the developers have seen more than once:

“Collectively, we have a lot of experience with users suffering unexpected issues because they have not configured important settings. In previous versions of Elasticsearch, misconfiguration of some of these settings were logged as warnings. Understandably, users sometimes miss these log messages. To ensure that these settings receive the attention that they deserve, Elasticsearch has bootstrap checks upon startup.”

The settings the documentation is referring to are described in the important settings and system settings documentation. In my friends case he didn’t increase the vm.max_map_count or the number of file descriptors available to the elasticsearch Java process. Once he got these fixed up his test cluster fired right up.