How Redhat 5 and Centos 5 detect hardware at boot

With the introduction of RHEL6 our beloved kudzu was removed from Redhat Enteprise Linux (it’s been gone from Fedora for quite some time). If you’re not familiar with kudzu, RHEL5 and below use it to detect new hardware when a system is bootstraped. All of the functionality that was part of kudzu is now handled by the kernel and udev, though a lot of sites will need to support RHEL 5 systems for years and years to come.

I was curious how kudzu detected new hardware, so I started reading through the kudzu man page and source code. I learned that hardware discovery / removal is actually pretty straight forward. When a box boots kudzu will probe the hardware and compare it against the contents of /etc/sysconfig/hwconf to see if anything was added or removed. Here is the relevant blurb from kudzu(8):

kudzu detects and configures new and/or changed hardware on a system. When started, kudzu detects the current hardware, and checks it against a database stored in /etc/sysconfig/hwconf, if one exists. It then determines if any hardware has been added or removed from the system. If so, it gives the users the opportunity to configure any added hardware, and unconfigure any removed hard- ware. It then updates the database in /etc/sysconfig/hwconf.

So when you get a prompt to add or remove hardware at startup, you are seeing kudzu in action. To view the list of hardware that kudzu is able to detect, you can run the kudzu command line utility with the “-p” option:

$ kudzu -p

class: OTHER
bus: PCI
detached: 0
driver: shpchp
desc: "VMware PCI Express Root Port"
vendorId: 15ad
deviceId: 07a0
subVendorId: 0000
subDeviceId: 0000
pciType: 1
pcidom:    0
pcibus:  0
pcidev: 18
pcifn:  7
   .....

I’m not exactly sure what led me to dig into this, but it was rather fun. I’m one of those guys who enjoys going to lunch with 10 pages of C source code and a highlighter. Not sure if this makes me a geek, but it’s kinda cool learning while you consume your lunch calories. ;)

Kudzu removed from RHEL6

With the introduction of RHEL6 the kudzu hardware hardware manager was removed. All of the functionality that was once a par of kudzu has been integrated into the kernel and udev, as evidenced by this e-mail correspondence with one of Redhat’s support engineers:

“Kudzu is removed from rhel6. The kernel should be taking care of module loading from this point onwards. When it enumerates the device through its own methods or udev. It should automatically load the drivers based on device enumeration/identification based on their BUS such as PCI id’s USB ids. Other non hardware modules such as networking protocols will/should be loaded automatically on use ( unsupported/tech preview modules being the exception to this rule).”

Since kudzu has been with us for quite some time I thought it only fitting that we give it a proper farewell. Now if I only had a goat to eat the left overs. ;)

Measuring hard drive seek times

I’ve been fascinated with I/O and file system performance for years, and chose prefetch.net as my domain name after reading about pre-fetching algorithms in the book UNIX file systems (a great book that I need to read again). Since most applications access data that is not laid out sequentially on a hard drive platter, seek times come into play when you start to look at getting the most bang for your buck on random I/O workloads.

This past week I saw this first hand. I was playing around with some disk drives and started to wonder how many seeks I could get per second. I was also curious what the access time of each seek was, so I started poking around for a tool that displays this information. After a quick search, I came across seeker.

When seeker is run with the path to a block device, it will initiate reads to a device and measure the number of seeks per second, the average time of each seek, and the starting and ending offsets that were used to randomly access blocks on the disk. Here is a sample run:

$ seeker /dev/sda

Seeker v3.0, 2009-06-17, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [156250000 blocks, 80000000000 bytes, 74 GB, 76293 MB, 80 GiB, 80000 MiB]
[512 logical sector size, 512 physical sector size]
[1 threads]
Wait 30 seconds..............................
Results: 50 seeks/second, 19.881 ms random access time (128677785 < offsets < 79965549334)

If you want to see the effects of multiple threads accessing a disk, you can pass a thread count to seeker:

$ seeker /dev/sda 32

Seeker v3.0, 2009-06-17, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [156250000 blocks, 80000000000 bytes, 74 GB, 76293 MB, 80 GiB, 80000 MiB]
[512 logical sector size, 512 physical sector size]
[32 threads]
Wait 30 seconds..............................
Results: 53 seeks/second, 18.587 ms random access time (36588683 < offsets < 79946851693)

In the example above 32 threads would try to access data at random locations and the results from these disk accesses would be measured and printed. Seeker is a super handy utility, and I would put it right up there with blktrace.

Speeding up SSH (SCP) data transfers

I’ll be the first to admit that I’m an SCP addict. It doesn’t matter what kind of data I’m working with, if it can be turned into an object that I move around with scp I’m in! One thing I’ve always noticed with scp is the dismal out of the box performance. I read quite some time back on the openssh mailing list that there were some fixed buffers inside openssh that prevented copy operations from fully utilizing high speed network links.

This past week my fellow bloging partner Mike sent me a link to the high performance ssh project. This nifty project created a patch that replaces the fixed length buffers with values determined at runtime. This is pretty fricking cool, and I hope they get their changes merged into the openssh source code! The site that currently hosts the HPN-SSH code also has a good article on tuning your TCP/IP network stack. Good stuff!

Gathering statistics from your postfix mail logs

I have been supporting postfix mail relays for several years. When you are tasked with ensuring that your SMTP relays work without issue, you need to stay on top of the logs and the current operational state of each server to prevent major problems from occurring. Problems can take the form of blacklisted relays, mail routing loops, spam attacks or people trying to exploit a known issue in the SMTP software. All of these items suck for a site that relies on mail to function.

To help me keep on top of the relays I support, I use a multi-faceted approach. I summarize the log files daily, create scripts to notify me of problems before they cause issues, and try to be proactive about my relays (as much as you can be). To summarize my logs I use a combination of custom developed scripts and the incredibly awesome pflogsumm Perl script. The output of the scripts I developed provides me with specific data, and pflogsum produces a beautiful report that summarizes the overall operation of my Postfix SMTP relays:

$ zcat /var/log/maillog-20110710.gz | pflogsumm.pl

Grand Totals
------------
messages

  19004   received
  20824   delivered
    696   forwarded
      3   deferred  (3  deferrals)
      0   bounced
      0   rejected (0%)
      0   reject warnings
      0   held
      0   discarded (0%)

  57945k  bytes received
 139101k  bytes delivered
     66   senders
     28   sending hosts/domains
    175   recipients
     31   recipient hosts/domains


Per-Day Traffic Summary
    date          received  delivered   deferred    bounced     rejected
    --------------------------------------------------------------------
    Jul  3 2011      2006       1875 
    Jul  4 2011      2212       1992          2 
    Jul  5 2011      2855       3568 
    Jul  6 2011      3212       3823 
    Jul  7 2011      3212       3962          1 
    Jul  8 2011      2581       2747 
    Jul  9 2011      2665       2654 
    Jul 10 2011       261        203 

Per-Hour Traffic Daily Average
    time          received  delivered   deferred    bounced     rejected
    --------------------------------------------------------------------
    0000-0100          78         51          0          0          0 
    0100-0200          74         51          0          0          0 
    0200-0300          99         87          0          0          0 
    0300-0400          73         52          0          0          0 
    0400-0500          73         53          0          0          0 
    0500-0600          76         68          0          0          0 
    0600-0700         110        129          0          0          0 
    0700-0800         106        134          0          0          0 
    0800-0900          84         70          0          0          0 
    0900-1000          92        103          0          0          0 
    1000-1100         103        106          0          0          0 
    1100-1200         102        103          0          0          0 
    1200-1300          95         93          0          0          0 
    1300-1400         103        116          0          0          0 
    1400-1500         108        124          0          0          0 
    1500-1600         112        142          0          0          0 
    1600-1700         114        129          0          0          0 
    1700-1800         111        125          0          0          0 
    1800-1900         135        176          0          0          0 
    1900-2000         144        194          0          0          0 
    2000-2100         134        224          0          0          0 
    2100-2200          88        101          0          0          0 
    2200-2300          82         85          0          0          0 
    2300-2400          82         88          0          0          0 

Host/Domain Summary: Message Delivery 
 sent cnt  bytes   defers   avg dly max dly host/domain
 -------- -------  -------  ------- ------- -----------
  10247    81716k       1     0.5 s    5.5 m  prefetch.net
    .....

Host/Domain Summary: Messages Received 
 msg cnt   bytes   host/domain
 -------- -------  -----------
   3094    22088k  server1.prefetch.net
     .....

Senders by message count
------------------------
   2288   badperson1@prefetch.net
   1630   badperson2@prefetch.net
     .....

Recipients by message count
---------------------------
   4415   donotspam1@prefetch.net
   3046   donotspam2@prefetch.net
     .....

Senders by message size
-----------------------
  15786k  badperson1@prefetch.net
  13414k  badperson2@prefetch.net
     .....

Recipients by message size
--------------------------
  29981k  donotspam1@prefetch.net
  25492k  donotspam2@prefetch.net
     .....

Messages with no size data
--------------------------
 D07FA650  fubar@prefetch.net
     .....

message deferral detail
-----------------------
  smtp (total: 3)
         2   25: Connection timed out
         1   lost connection with prefetch.net[1.1.1.1...

message bounce detail (by relay): none

message reject detail: none

message reject warning detail: none

message hold detail: none

message discard detail: none

smtp delivery failures: none

Warnings
--------
  smtpd (total: 14)
        14   non-SMTP command from unknown[1.1.1.1]: To: 
  trivial-rewrite (total: 430)
       430   database /etc/postfix/transport.db is older than source file /e...

Fatal Errors: none

Panics: none

Master daemon messages: none

I can’t tell you how many times I’ve run this script and seen something in the output that left me wondering WTF?!? Several times I’ve been able to use the output from this script to nip a problem before it becomes a major issue. The output is also handy for developing some additional monitoring for your SMTP services. If you are using postfix you should definitely check out pflogsumm. You will be glad you did!!!

Using netstat and dropwatch to observe packet loss on Linux servers

Anyone that is running a modern Operating System is most likely utilizing TCP/IP to send and receive data. Modern TCP/IP stacks are somewhat complex and have a slew of tunables to control their behavior. The choice of when and when not to tune is not always super clear cut, since documentation and the advice of various network “experts” doesn’t always jive.

When I’m looking into performance problems that are network related one of the first things I review is the netstat “-s” output:

$ netstat -s

Ip:
    25030820 total packets received
    269 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    21629075 incoming packets delivered
    21110503 requests sent out
Icmp:
    12814 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 2
        echo requests: 12809
        echo replies: 3
    12834 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 22
        echo request: 3
        echo replies: 12809
IcmpMsg:
        InType0: 3
        InType3: 2
        InType8: 12809
        OutType0: 12809
        OutType3: 22
        OutType8: 3
Tcp:
    138062 active connections openings
    1440632 passive connection openings
    7 failed connection attempts
    2262 connection resets received
    8 connections established
    12225207 segments received
    10785279 segments send out
    10269 segments retransmited
    0 bad segments received.
    69115 resets sent
Udp:
    553643 packets received
    22 packets to unknown port received.
    0 packet receive errors
    6911684 packets sent
UdpLite:
TcpExt:
    33773 invalid SYN cookies received
    154132 TCP sockets finished time wait in fast timer
    6 time wait sockets recycled by time stamp
    72284 delayed acks sent
    3 delayed acks further delayed because of locked socket
    Quick ack mode was activated 269 times
    3359 packets directly queued to recvmsg prequeue.
    2592713 packets directly received from backlog
    4021 packets directly received from prequeue
    3557638 packets header predicted
    1732 packets header predicted and directly queued to user
    1939991 acknowledgments not containing data received
    3179859 predicted acknowledgments
    1631 times recovered from packet loss due to SACK data
    Detected reordering 1034 times using FACK
    Detected reordering 1007 times using SACK
    Detected reordering 622 times using time stamp
    1557 congestion windows fully recovered
    4236 congestion windows partially recovered using Hoe heuristic
    299 congestion windows recovered after partial ack
    2 TCP data loss events
    5 timeouts after SACK recovery
    5 timeouts in loss state
    2511 fast retransmits
    2025 forward retransmits
    88 retransmits in slow start
    5518 other TCP timeouts
    295 DSACKs sent for old packets
    35 DSACKs sent for out of order packets
    251 DSACKs received
    25247 connections reset due to unexpected data
    2248 connections reset due to early user close
    6 connections aborted due to timeout
    TCPSACKDiscard: 2707
    TCPDSACKIgnoredOld: 65
    TCPDSACKIgnoredNoUndo: 12
    TCPSackShifted: 4176
    TCPSackMerged: 2301
    TCPSackShiftFallback: 98834
IpExt:
    InMcastPkts: 2
    OutMcastPkts: 3390453
    InBcastPkts: 8837402
    InOctets: 5156017179
    OutOctets: 2509510134
    InMcastOctets: 80
    OutMcastOctets: 135618120
    InBcastOctets: 2127986990

The netstat output contains a slew of data you can be used to see how much data your host is processing, if it’s accepting and processing data efficiently and if the buffers that link the various layers (Ethernet -> IP -> TCP -> APP) are working optimally.

When I build new Linux machines via kickstart, I make sure my profile contains the ktune package. That is all the tuning I do to start, unless an application or database requires a specific setting (think large pages and SysV IPC settings for Oracle).

Once I’ve met with an application resource and a business analyst, I like to pound the application with a representative benchmark and compare the system performance before and after the stress test was run. By comparing the before and after results I can see where exactly the system is choking (this is very rare), or if the application needs to be modified to accommodate additional load. If the application is a standard TCP/IP based application that utilizes HTTP, I’ll typically turn to siege and iPerf to stress my applications and systems.

If during load-testing I notice that data is being dropped in one or more queues, I’ll fire up dropwatch to observe where in the TCP/IP stack data is being dropped:

$ dropwatch -l kas

Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at netlink_sendskb+14d (0xffffffff813df30e)
1 drops at ip_rcv_finish+32e (0xffffffff813f0c93)
4 drops at ip_local_deliver+291 (0xffffffff813f12d7)
64 drops at unix_stream_recvmsg+44a (0xffffffff81440fb9)
32 drops at ip_local_deliver+291 (0xffffffff813f12d7)
23 drops at unix_stream_recvmsg+44a (0xffffffff81440fb9)
1 drops at ip_rcv_finish+32e (0xffffffff813f0c93)
4 drops at .brk.dmi_alloc+1e60bd47 (0xffffffffa045fd47)
2 drops at skb_queue_purge+60 (0xffffffff813b6542)
64 drops at unix_stream_recvmsg+44a (0xffffffff81440fb9)

This allows you to see if data is being dropped at the link layer, the IP layer, the UDP/TCP layer or the application layer. If the drops are occurring somewhere in TCP/IP (i.e. inside the kernel) I will review the kernel documentation and source code to see what occurs at the specific areas of the kernel listed in the dropwatch output, and find the sysctl values that control the sizes of the buffers at that layer (some are dynamic, some are fixed).

Tuning applications to perform optimally has filled dozens and dozens of books, and it’s a fine art that you learn from seeing problems erupt in the field. It also helps to know how to intepret all the values in the netstat output, and I cannot recommend TCP/IP volume I, TCP/IP volume II and TCP/IP volume III enough! Everyone who runs an IP connected system should be required to read these before they are allowed access to the system. :)