Troubleshooting a bizarre logstash CPU problem


This week I was converting some legacy shell scripts to ansible roles and wandered into a bizarre issue with one of my elasticsearch servers. After committing a couple of changes my CI system rejected the commit due to a system resource issue. When I logged into the system to troubleshoot the issue I noticed the CPU was pegged:

$ mpstat 1

Linux 3.10.0-514.26.2.el7.x86_64 (elastic02) 09/02/2017 _x86_64_ (2 CPU)

11:32:59 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:33:00 AM all 94.03 0.00 3.98 0.00 0.00 0.00 0.00 0.00 0.00 1.99
11:33:01 AM all 94.00 0.00 3.50 0.00 0.00 0.00 0.00 0.00 0.00 2.50
11:33:02 AM all 93.50 0.00 5.50 0.00 0.00 0.00 0.00 0.00 0.00 1.00

This system is used solely to test changes so it should have been 100% idle. Htop showed the Logstash java process as the top CPU consumer so I ventured off to the log directory to see if anything was awry. Sure enough, Logstash was complaining about a missing file:

[2017-09-02T11:37:43,884][ERROR][logstash.agent ] Cannot create pipeline {:reason=>"Something is wrong with your configuration."}
[2017-09-02T11:37:52,529][ERROR][logstash.filters.geoip ] Invalid setting for geoip filter plugin:

filter {
geoip {
# This setting must be a path
# File does not exist or cannot be opened /elk/logstash/geoip/GeoLite2-City.mmdb
database => "/elk/logstash/geoip/GeoLite2-City.mmdb"
...
}
}

One of my commits automated the GeoIP database download process. This subtle change contained a bug (more on that in a future post) which was preventing the file from being downloaded and extracted. By default, logstash should exit if it detects a configuration issue. It shouldn’t spin on a CPU. To see what was going on I fired up my good friend strace and attmpted to attch to the PID I retrieved from ps:

$ strace -f -p 4288

strace: attach: ptrace(PTRACE_ATTACH, ...): No such process

Now that is odd. When I ran ps again and looked closer at the pid column I noticed that the java process had a new PID. I ran it a second time and the pid changed again. So it appeared that logstash was dying and then being restarted over and overr. The Logstash service on this machine is managed by systemd and it immediately dawned on me that the “Restart=on-failure” unit file configuration directive (this will restart a failed process) was the cause of my troubles! After a bit of troubleshooting I located the problem with my ansible role, corrected it and low and behold the system load returned to normal. Bugs and cascading failures happen and this problem never made its way to production because my automated test suite let me know that something was wrong. Thank you jenkins!

This article was posted by Matty on 2017-09-02 12:03:00 -0400 -0400