Good write up Linux consistent network device naming

In RHEL 6.1 the default names assigned to Dell server network interfaces changed from ethX to emX and pXpX. The new names describe where a network interface physically resides in the system, and will have the following format:

emX – the X (first, second, etc.) onboard interface
pXpY – PCI device X port Y

Dell wrote a really good white paper on this, and the following text from the document summarizes how the pieces fit together:

“A naming mechanism that can impart meaning to the network interface‟s name based on the physical location of a network port in concordance to the intended system design is necessary. To achieve that, the system firmware has the ability to communicate the intended order for network devices on the mother board to the Operating System via standard mechanisms such as SMBIOS and ACPI.

The new naming scheme uses ‘biosdevname’ udev helper utility , developed by Dell and released under GPL, suggests new names based on the location of the network adapters on the system as suggested by system BIOS.”

I like the new format, and this will definitely be a nice addition to hardware provisioning systems. Hopefully in the near future we won’t have to poke around with lspci to see which interface is which. :)

Speeding up SSH (SCP) data transfers

I’ll be the first to admit that I’m an SCP addict. It doesn’t matter what kind of data I’m working with, if it can be turned into an object that I move around with scp I’m in! One thing I’ve always noticed with scp is the dismal out of the box performance. I read quite some time back on the openssh mailing list that there were some fixed buffers inside openssh that prevented copy operations from fully utilizing high speed network links.

This past week my fellow bloging partner Mike sent me a link to the high performance ssh project. This nifty project created a patch that replaces the fixed length buffers with values determined at runtime. This is pretty fricking cool, and I hope they get their changes merged into the openssh source code! The site that currently hosts the HPN-SSH code also has a good article on tuning your TCP/IP network stack. Good stuff!

Using netstat and dropwatch to observe packet loss on Linux servers

Anyone that is running a modern Operating System is most likely utilizing TCP/IP to send and receive data. Modern TCP/IP stacks are somewhat complex and have a slew of tunables to control their behavior. The choice of when and when not to tune is not always super clear cut, since documentation and the advice of various network “experts” doesn’t always jive.

When I’m looking into performance problems that are network related one of the first things I review is the netstat “-s” output:

$ netstat -s

Ip:
    25030820 total packets received
    269 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    21629075 incoming packets delivered
    21110503 requests sent out
Icmp:
    12814 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 2
        echo requests: 12809
        echo replies: 3
    12834 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 22
        echo request: 3
        echo replies: 12809
IcmpMsg:
        InType0: 3
        InType3: 2
        InType8: 12809
        OutType0: 12809
        OutType3: 22
        OutType8: 3
Tcp:
    138062 active connections openings
    1440632 passive connection openings
    7 failed connection attempts
    2262 connection resets received
    8 connections established
    12225207 segments received
    10785279 segments send out
    10269 segments retransmited
    0 bad segments received.
    69115 resets sent
Udp:
    553643 packets received
    22 packets to unknown port received.
    0 packet receive errors
    6911684 packets sent
UdpLite:
TcpExt:
    33773 invalid SYN cookies received
    154132 TCP sockets finished time wait in fast timer
    6 time wait sockets recycled by time stamp
    72284 delayed acks sent
    3 delayed acks further delayed because of locked socket
    Quick ack mode was activated 269 times
    3359 packets directly queued to recvmsg prequeue.
    2592713 packets directly received from backlog
    4021 packets directly received from prequeue
    3557638 packets header predicted
    1732 packets header predicted and directly queued to user
    1939991 acknowledgments not containing data received
    3179859 predicted acknowledgments
    1631 times recovered from packet loss due to SACK data
    Detected reordering 1034 times using FACK
    Detected reordering 1007 times using SACK
    Detected reordering 622 times using time stamp
    1557 congestion windows fully recovered
    4236 congestion windows partially recovered using Hoe heuristic
    299 congestion windows recovered after partial ack
    2 TCP data loss events
    5 timeouts after SACK recovery
    5 timeouts in loss state
    2511 fast retransmits
    2025 forward retransmits
    88 retransmits in slow start
    5518 other TCP timeouts
    295 DSACKs sent for old packets
    35 DSACKs sent for out of order packets
    251 DSACKs received
    25247 connections reset due to unexpected data
    2248 connections reset due to early user close
    6 connections aborted due to timeout
    TCPSACKDiscard: 2707
    TCPDSACKIgnoredOld: 65
    TCPDSACKIgnoredNoUndo: 12
    TCPSackShifted: 4176
    TCPSackMerged: 2301
    TCPSackShiftFallback: 98834
IpExt:
    InMcastPkts: 2
    OutMcastPkts: 3390453
    InBcastPkts: 8837402
    InOctets: 5156017179
    OutOctets: 2509510134
    InMcastOctets: 80
    OutMcastOctets: 135618120
    InBcastOctets: 2127986990

The netstat output contains a slew of data you can be used to see how much data your host is processing, if it’s accepting and processing data efficiently and if the buffers that link the various layers (Ethernet -> IP -> TCP -> APP) are working optimally.

When I build new Linux machines via kickstart, I make sure my profile contains the ktune package. That is all the tuning I do to start, unless an application or database requires a specific setting (think large pages and SysV IPC settings for Oracle).

Once I’ve met with an application resource and a business analyst, I like to pound the application with a representative benchmark and compare the system performance before and after the stress test was run. By comparing the before and after results I can see where exactly the system is choking (this is very rare), or if the application needs to be modified to accommodate additional load. If the application is a standard TCP/IP based application that utilizes HTTP, I’ll typically turn to siege and iPerf to stress my applications and systems.

If during load-testing I notice that data is being dropped in one or more queues, I’ll fire up dropwatch to observe where in the TCP/IP stack data is being dropped:

$ dropwatch -l kas

Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at netlink_sendskb+14d (0xffffffff813df30e)
1 drops at ip_rcv_finish+32e (0xffffffff813f0c93)
4 drops at ip_local_deliver+291 (0xffffffff813f12d7)
64 drops at unix_stream_recvmsg+44a (0xffffffff81440fb9)
32 drops at ip_local_deliver+291 (0xffffffff813f12d7)
23 drops at unix_stream_recvmsg+44a (0xffffffff81440fb9)
1 drops at ip_rcv_finish+32e (0xffffffff813f0c93)
4 drops at .brk.dmi_alloc+1e60bd47 (0xffffffffa045fd47)
2 drops at skb_queue_purge+60 (0xffffffff813b6542)
64 drops at unix_stream_recvmsg+44a (0xffffffff81440fb9)

This allows you to see if data is being dropped at the link layer, the IP layer, the UDP/TCP layer or the application layer. If the drops are occurring somewhere in TCP/IP (i.e. inside the kernel) I will review the kernel documentation and source code to see what occurs at the specific areas of the kernel listed in the dropwatch output, and find the sysctl values that control the sizes of the buffers at that layer (some are dynamic, some are fixed).

Tuning applications to perform optimally has filled dozens and dozens of books, and it’s a fine art that you learn from seeing problems erupt in the field. It also helps to know how to intepret all the values in the netstat output, and I cannot recommend TCP/IP volume I, TCP/IP volume II and TCP/IP volume III enough! Everyone who runs an IP connected system should be required to read these before they are allowed access to the system. :)

Using the Linux arping utility to send out gratuitious ARPs

I managed a number of Redhat and Heartbeat clusters. On a couple of occassions the services that manage the virtual IPs have misbehaved, and the storage has ended up on one node and the virtual IP on another. To fix this I need to manually move the virtual IP to the host it belongs on, and then issue a gratuitous ARP so other hosts on the network clear their ARP cache and use the MAC address associated with the device the virtual IP now resides on.

The Linux arping utility can be used to send out a gratuitious ARP (an “ARP Request” or “ARP Response” is the actual item sent) to update hosts on your network . To use arping to update the ARP cache on all of the devices on the local layer-3 network, you can run arping with the -U option (unsolicited ARP mode), the “-I” option (interface to send the gratuitious ARP out on) and the IP that is assigned to the interface you want the ARP cache entry updated for:

$ arping -U -I eth0 186.168.86.100

Once executed, the arping command will cause a layer-2 broadcast message to be sent, which should cause the ARP caches to be updated on all of the hosts on the local network (ARP replies are sent to the layer-2 broadcast address, so all hosts on your network should receive these). This assumes that your hosts aren’t configured to ignore unsolicited ARP requests. ;) Arping can also be used to detect duplicate IPs when run with the “-D” (duplicate address detection mode) option. This is a handy tool that everyone should be aware of.

Stopping your RHEL virtual interfaces from starting at boot. ONPARENT you say?

I recently debugged a pretty interesting problem with one of my clusters. When I rebooted one of the nodes, I noticed that a virtual interface that had ONBOOT set to no was started when the network interfaces were initialized. For those not familiar with RHEL systems, the ONBOOT directive tells the network initialization scripts not to start a given interface. This was rather confusing, and after some experimenting with a virtual machine I saw the EXACT same behavior. Something had to be awry here!

After reading through the ifup-aliases script, I saw a reference to the ONPARENT directive. This directive had similar properties to ONBOOT, but only applied to virtual interfaces. A quick Google search revealed that this is indeed the purpose of the directive, though I haven’t seen a whole lot of documentation that refers to it. :(

So if you need to stop a virtual interface from starting when the network interfaces are initialized, you need to set ONPARENT instead of ONBOOT to no. Here is a sample ifcfg file that shows how to use it:

$ cat /etc/sysconfig/network-scripts/ifcfg-bond0:1

DEVICE=bond0:1
BOOTPROTO=static
ONPARENT=no
IPADDR=192.168.1.21
NETMASK=255.255.255.0
NETWORK=192.168.1.0

I have no idea why Redhat couldn’t use ONBOOT for both, but then again I don’t understand a lot of things that come out of Raleigh. Food for thought!

How to learn everything you ever wanted to know about Linux sockets

Viewing network socket data is something SysAdmins do often. We could be called on to see if a connection is established to a host, if an application is listening on a given port, or we may need to review the network connection table as a whole to see what a server is doing (this is especially valuable when DDOS attacks occur). The netstat and lsof tools provide quite a bit of visibility into this area, but I’ve recently started firing up the ss (socket stat) tool when I need to view socket information. Socket stat can display pretty much everything you ever wanted to know about the connections on your server. To get a basic breakdown of ports that applications are listening on, you can run ss with the “-l” option:

$ ss -l

Recv-Q Send-Q                         Local Address:Port                             Peer Address:Port   
0      128                                       :::ssh                                        :::*       
0      128                                        *:ssh                                         *:*       
0      128                                127.0.0.1:ipp                                         *:*       
0      128                                      ::1:ipp                                        :::*       

To view the processes that are using each listening socket, you can run ss with the “-p” option:

$ ss -p

State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
CLOSE-WAIT 1      0          192.168.1.1:57666         192.168.1.2:http     users:(("gvfsd-http",16992,14))

To display the amount of memory being consumed by the socket buffers, you can use the ss “-m” option (this is quite handy!):

$ ss -e -m

State       Recv-Q Send-Q                    Local Address:Port                        Peer Address:Port   
CLOSE-WAIT  1      0                        192.168.1.1:57666                       192.168.1.2:http     uid:500 ino:40834026 sk:ffff88022d3b2080
	 mem:(r360,w0,f3736,t0)

Additionally, you can use the ss “-s” option to summarize all of the socket states:

$ ss -s

Total: 571 (kernel 589)
TCP:   17 (estab 10, closed 0, orphaned 0, synrecv 0, timewait 0/0), ports 0

Transport Total     IP        IPv6
*	  589       -         -        
RAW	  0         0         0        
UDP	  10        6         4        
TCP	  17        14        3        
INET	  27        20        7        
FRAG	  0         0         0       

There are also options to display information about specific socket types (UNIX domain, UDP, TCP, etc), and to dig deep into the connection table information (see the “-i” option for further details). If you have a current release of CentOS, RHEL or Fedora, this awesome tool should be on your system. It’s part of the iproute package.