A year and a half ago I started using prometheus and grafana to graph metric data. This combination is incredibly powerful and I have been absolutely blown away by the amount of actionable intelligence I’ve been able to convey in our dashboards. Prometheus has a number of exporters which can be used to retrieve metric data from remote endpoints and stash it in its highly optimized time series database. There are exporters for MongoDB, Postgres, vSphere, Nginx, HAProxy, JMX as well as numerous other applications. If it’s a popular application there is most likely an exporter for it. If an exporter isn’t available the Prometheus developers have made it crazy easy to develop new ones with Python and Go.
I recently started playing with pyvmomi which is a Python SDK for the VMware vSphere API. The SDK allows you to view and manipulate all aspects of vSphere programmatically. This allows you to automate things like collecting and graphing host and VM performance data, adding disks, putting hosts into maintenance mode or generating common reports. It’s powerful stuff! Being super curious I decided to write a Prometheus VM performance metrics exporter so I could overlay business metrics on top of HTTP responses which could then be laid on top of VM and host performance metrics. After a couple hours of hacking my Flask-based exporter was spitting out VM performance metrics. Or at least that’s what I thought. Periodically the exporter would die and the following stack trace would be written to the console:
1.2.3.4 - - [18/Mar/2018 01:10:48] "GET /metrics HTTP/1.1" 200 -
Traceback (most recent call last):
  File "/usr/lib64/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.7/SocketServer.py", line 651, in __init__
    self.finish()
  File "/usr/lib64/python2.7/SocketServer.py", line 710, in finish
    self.wfile.close()
  File "/usr/lib64/python2.7/socket.py", line 279, in close
    self.flush()
  File "/usr/lib64/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
Leading up to the exception everything looked good. There were 1020 GET /metrics HTTP requests with status code 200s in the access logs and my metrics were flowing into my Prometheus development environment. But right around 1020 requests my exporter fell on its face. Splat! Based on previous dealings with file descriptor exhaustion I started to wonder if my code was leaking file descriptors. To verify my hypothesis I fired up the program and changed into the processes fd (file descriptor) directory:
$ cd /proc/5799/fd
To get a baseline I ran ls:
$ ls -la
lrwx------. 1 matty matty 64 Mar 20 09:33 0 -> /dev/pts/2
lrwx------. 1 matty matty 64 Mar 20 09:33 1 -> /dev/pts/2
lr-x------. 1 matty matty 64 Mar 20 09:33 10 -> /dev/urandom
lrwx------. 1 matty matty 64 Mar 20 09:33 2 -> /dev/pts/2
lrwx------. 1 matty matty 64 Mar 20 09:33 3 -> socket:[99227]
lrwx------. 1 matty matty 64 Mar 20 09:33 4 -> socket:[99228]
Next I ran a simple shell loop to request the /metrics URI 10 times:
for i in $(seq 1 10); do
   echo "GET /metrics HTTP/1.0" | nc localhost 8000
done
Then I ran ls again to compare the current file descriptor count to the one I generated previously:
$ ls -l
lrwx------. 1 matty matty 64 Mar 20 09:33 0 -> /dev/pts/2
lrwx------. 1 matty matty 64 Mar 20 09:33 1 -> /dev/pts/2
lr-x------. 1 matty matty 64 Mar 20 09:33 10 -> /dev/urandom
lrwx------. 1 matty matty 64 Mar 20 09:35 11 -> socket:[102747]
lrwx------. 1 matty matty 64 Mar 20 09:35 12 -> socket:[102765]
lrwx------. 1 matty matty 64 Mar 20 09:35 13 -> socket:[102773]
lrwx------. 1 matty matty 64 Mar 20 09:36 14 -> socket:[102778]
lrwx------. 1 matty matty 64 Mar 20 09:36 15 -> socket:[102785]
lrwx------. 1 matty matty 64 Mar 20 09:36 16 -> socket:[102792]
lrwx------. 1 matty matty 64 Mar 20 09:33 2 -> /dev/pts/2
lrwx------. 1 matty matty 64 Mar 20 09:33 3 -> socket:[99227]
lrwx------. 1 matty matty 64 Mar 20 09:33 4 -> socket:[99228]
lrwx------. 1 matty matty 64 Mar 20 09:35 6 -> socket:[103556]
lrwx------. 1 matty matty 64 Mar 20 09:35 7 -> socket:[102671]
lrwx------. 1 matty matty 64 Mar 20 09:35 8 -> socket:[103572]
lrwx------. 1 matty matty 64 Mar 20 09:35 9 -> socket:[102722]
Sure enough, my code was leaking file descriptors. But where exactly was this occurring? The code created a connection to vCenter when the /metrics end point was scraped and closed it when it was complete (the final code uses persistent connections to avoid the set up and tear down costs). I also verified this by running strace against the server process:
$ strace -e trace=open,socket,connect,close stats.py
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 6
connect(6, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("1.2.3.4")}, 16) = 0
*** where is close(6)?!?! ***
The socket was set up, the connect completed but a close() wasn’t issued to the file desriptor returned from socket(). When I ran the code under pdb I saw it call the Disconnect() method to close the connection. This method is defined in connect.py:
def Disconnect(si):
   """
   Disconnect (logout) service instance
   @param si: Service instance (returned from Connect)
   """
   # Logout
   __Logout(si)
   SetSi(None)
def SetSi(si):
   """ Set the saved service instance. """
   global _si
   _si = si
So the serviceInstance object should get set to None causing the file descriptor to be closed once the last reference to the object is gone. But for some reason this wasn’t happening. Reviewing the code paths my program was taking revealed this gem:
atexit.register(Disconnect, serviceInstance)
This set off a light in my head. The atexit() method takes an object and a function to call on the object when the program exits. We can see this first hand in atexit.py:
def register(func, *targs, **kargs):
    """register a function to be executed upon normal program termination
    func - function to be called at exit
    targs - optional arguments to pass to func
    kargs - optional keyword arguments to pass to func
    func is returned to facilitate usage as a decorator.
    """
    _exithandlers.append((func, targs, kargs))
    return func
I was calling atexit() each time I created a connection to vCenter which caused the current serviceinstance object to get added to the _exithandlers list. When I called Disconnect it set the object to None but it wasn’t deleted (and the file descriptor wasn’t close()‘ed) since there was a still a reference to it in the _exithandlers list). To verify this I fired up pdb, set a breakpoint on atexit.register() and then dumped _exithandlers after running a couple of tests:
$ python -m pdb ./stats.py
(Pdb) break atexit.register
Breakpoint 2 at /usr/lib64/python2.7/atexit.py:37
(Pdb) c
> /usr/lib64/python2.7/atexit.py(46)register()
-> _exithandlers.append((func, targs, kargs))
(Pdb) s
> /usr/lib64/python2.7/atexit.py(47)register()
-> return func
(Pdb) print _exithandlers
[(<function shutdown at 0x7f50c0e69b90>, (), {}), (<function Disconnect at 0x7f50c03fe7d0>, ('vim.ServiceInstance:ServiceInstance',), {}), (<function Disconnect at 0x7f50c03fe7d0>, ('vim.ServiceInstance:ServiceInstance',), {}), (<function Disconnect at 0x7f50c03fe7d0>, ('vim.ServiceInstance:ServiceInstance',), {}), (<function Disconnect at 0x7f50c03fe7d0>, ('vim.ServiceInstance:ServiceInstance',), {})]
Bingo! My code was leaking file descriptors because atexit() was leaving around references to the object that encapsulated the file descriptor associated with the socket. Commenting out atexit() fixed my issue and the code is now working splendidly. These types of issues pop up during development and it’s incredibly fun debugging them! I learned a ton about the Python debugger during this debugging session and became much more familiar with the code for various Python modules. I also got to read the vast majority of the PyVim and pyVmomi source code which made developing this a snap! atexit(), stage right.