Testing SSL services

If you manage web applications and servers, you may have encountered a poorly written application or a web server that periodically hangs for no reason. These issues usually pop up out of the blue, and most people rely on their user community to notifiy them when problems are detected. To ensure timely notifications when these problems occur, I developed ssl-service-check. ssl-service-check is written in Bourne shell, and uses the OpenSSL toolkit to connect to a service and issue a “GET /.” If the service fails to respond, ssl-cervice-check will log an error to syslog and send an e-mail to the address defined in the global ADMINS variable. To test if the prefetch.net web server is handling requests on TCP port 444, we can execute ssl-service-check with the “-s” (server to connect to) and “-p” (port number to connect to) options:

$ ssl-service-check.sh -s mail.prefetch.net -p 444

$ tail -1 /var/adm/messages
Nov 3 18:23:28 tigger matty: [ID 702911 daemon.notice] Failed to connect to mail.prefetch.net on Port 444

ssl-service-check was written to work with cron, and can easily be integrated with a network monitoring solution.

Adding mirrors to Veritas Volume Manager volumes

One of the cool features of Veritas Volume Manager (VxVM) is it’s ability to change the layout of a volume on the fly with vxasssist(1m). This option has helped me numerous times, especially when I needed to mirror volumes that weren’t mirrored. Given the following unmirrored striped volume:

$ vxprint -hft

Disk group: oradg

DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         NVOLUME      KSTATE   STATE
V  NAME         RVG/VSET/CO  KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO

dg oradg        default      default  10000    1127240283.19.winnie

dm c1t1d0       c1t1d0s2     auto     2048     35521408 -
dm c1t2d0       c1t2d0s2     auto     2048     35521408 -
dm c1t3d0       c1t3d0s2     auto     2048     35521408 -
dm c1t4d0       c1t4d0s2     auto     2048     35365968 -
dm c1t5d0       c1t5d0s2     auto     2048     35521408 -
dm c1t6d0       c1t6d0s2     auto     2048     35521408 -

v  oravol01     -            ENABLED  ACTIVE   20971520 SELECT    oravol01-01 fsgen
pl oravol01-01  oravol01     ENABLED  ACTIVE   20971776 STRIPE    3/128    RW
sd c1t1d0-01    oravol01-01  c1t1d0   0        6990592  0/0       c1t1d0   ENA
sd c1t2d0-01    oravol01-01  c1t2d0   0        6990592  1/0       c1t2d0   ENA
sd c1t3d0-01    oravol01-01  c1t3d0   0        6990592  2/0       c1t3d0   ENA

We can easily add a mirror by invoking vxassist(1m) with the “mirror” option:

$ vxassist mirror oravol01 layout=stripe ncol=3 &

The mirror option accepts a layout option and several keywords to control the layout of the new mirror. In this example we used a 3-column striped plex to match the layout of the existing plex. After the mirror operation completes, the volume will contain a second plex (the mirror) that matches the original:

$ vxprint -hft

Disk group: oradg

DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         NVOLUME      KSTATE   STATE
V  NAME         RVG/VSET/CO  KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO

dg oradg        default      default  10000    1127240283.19.winnie

dm c1t1d0       c1t1d0s2     auto     2048     35521408 -
dm c1t2d0       c1t2d0s2     auto     2048     35521408 -
dm c1t3d0       c1t3d0s2     auto     2048     35521408 -
dm c1t4d0       c1t4d0s2     auto     2048     35365968 -
dm c1t5d0       c1t5d0s2     auto     2048     35521408 -
dm c1t6d0       c1t6d0s2     auto     2048     35521408 -

v  oravol01     -            ENABLED  ACTIVE   20971520 SELECT    -        fsgen
pl oravol01-01  oravol01     ENABLED  ACTIVE   20971776 STRIPE    3/128    RW
sd c1t1d0-01    oravol01-01  c1t1d0   0        6990592  0/0       c1t1d0   ENA
sd c1t2d0-01    oravol01-01  c1t2d0   0        6990592  1/0       c1t2d0   ENA
sd c1t3d0-01    oravol01-01  c1t3d0   0        6990592  2/0       c1t3d0   ENA
pl oravol01-02  oravol01     ENABLED  ACTIVE   20971776 STRIPE    3/128    RW
sd c1t4d0-01    oravol01-02  c1t4d0   0        6990592  0/0       c1t4d0   ENA
sd c1t5d0-01    oravol01-02  c1t5d0   0        6990592  1/0       c1t5d0   ENA
sd c1t6d0-01    oravol01-02  c1t6d0   0        6990592  2/0       c1t6d0   ENA

Veritas makes managing storage a snap!

Determining the reason behind rx_overflow values (part 1).

While performing some routine checks on one of the servers I support, I noticed numerous input errors on Gigabit Ethernet interface zero:

$ netstat -i

Name  Mtu  Net/Dest      Address      Ipkts  Ierrs Opkts  Oerrs Collis Queue
lo0   8232 loopback      localhost    959    0     959    0     0      0
ge0   1500 server1       server1      713548208 155599 680686711 0     0    0

Since this was a Sun server running Solaris 9, I fired up the kstat(1m) utility to find the cause of these errors:

$ kstat -m ge -i 0

module: ge                              instance: 0
name:   ge0                             class:    net

        align_errors                    0
        allocbfail                      0
        brdcstrcv                       72370571
        brdcstxmt                       2878
        carrier_errors                  0
        collisions                      0
        crtime                          8.264993338
        defer_timer_exp                 0
        defer_xmts                      0
        drop                            4239759
        ex_collisions                   0
        excessive_coll                  0
        fcs_errors                      0
        first_coll                      0
        ge_csumerr                      8
        ge_queue_cnt                    0
        ge_queue_full_cnt               0
        ierrors                         155599
        ifspeed                         1000000000
        inits                           28
        ipackets                        713548130
        ipackets64                      713548130
        jabber                          0
        late_coll                       0
        link_up                         1
        mac_mode                        2
        macrcv_errors                   0
        macxmt_errors                   0
        multircv                        14521873
        multixmt                        0
        no_free_rx_desc                 0
        no_tmds                         0
        nocanput                        3625
        nocarrier                       1
        norcvbuf                        0
        noxmtbuf                        0
        obytes                          978892756
        obytes64                        615159216084
        oerrors                         0
        opackets                        680686711
        opackets64                      680686711
        pause_off_cnt                   0
        pause_on_cnt                    0
        pause_rcv_cnt                   0
        pause_time_cnt                  0
        pci_badack                      0
        pci_bus_speed                   33
        pci_bus_width                   0
        pci_data_parity_err             0
        pci_det_parity_err              0
        pci_dtrto                       0
        pci_rcvd_master_abort           0
        pci_rcvd_target_abort           0
        pci_signal_system_err           0
        pci_signal_target_abort         0
        peak_attempt_cnt                0
        rbytes                          2933894033
        rbytes64                        535509838737
        rcv_dma_mode                    2
        rx_align_err                    0
        rx_code_viol_err                0
        rx_crc_err                      0
        rx_error_ack                    0
        rx_hang                         0
        rx_late_error                   0
        rx_length_err                   0
        rx_overflow                     155578
        rx_parity_error                 0
        rxinits                         0
        rxtag_error                     0
        slv_error_ack                   0
        slv_parity_error                0
        snaptime                        8369381.49486557
        sqe_errors                      0
        toolong_errors                  0
        tx_error_ack                    0
        tx_late_error                   0
        tx_parity_error                 0
        txinits                         0
        txmac_maxpkt_err                0
        txmac_urun                      0
        xmit_dma_mode                   6

After reviewing the kstat(1m) output I noticed that the rx_overflow value was well in excess of 150k. Since the word “overflow” is never a good sign, I started to research this issue by reading the manual page for gld(7D). This page contains descriptions for the generic LAN driver (gld) kstat values, but for some reason didn’t include a description for rx_overflow (the name is self-evident, but I wanted a definite answer). After a quick Google search I came across the following information in the the Sun Maximizing Performance of a Gigabit Ethernet NIC Interface blueprint:

“When rx_overflow is incrementing, packet processing is not keeping up with the packet arrival rate. If it is incrementing and no_free_rx_desc is not, this indicates that the PCI bus or SBus bus is presenting an issue to the flow of packets through the device. This could be because the ge card is plugged into a slower I/O bus. You can confirm the bus speed by looking at the pci_bus_speed statistic. An SBus bus speed of 40 MHz or a PCI bus speed of 33 MHz might not be sufficient to sustain full bidirectional one-gigabit Ethernet traffic. Another scenario that can lead to rx_overflow incrementing on its own is sharing the I/O bus with another device that has similar bandwidth requirements to those of the ge card.”

After reading through the blueprint, I used the blueprint’s advice and checked the no_free_rx_desc value. The no_free_rx_desc value was set to zero, so I again used the blueprint’s advice and checked the hardware configuration. I first reviewed the prtdiag(1m) output to get the server identification string, and then turned to the Sunsolve FE handbook. The handbook indicated that the that the PCI bus ran at a clock rate of 33MHZ, and the prtdiag(1m) output indiciated that the disk controller and GE adaptor shared the PCI bus. To ensure that disk I/O bandwidth wasn’t a problem, I fired up iostat(1m) ad monitored the number of bytes written per second. There was little I/O traffic, so it didn’t seem to be a bus congestion problem. Next I reviewed the recommended solutions in the blueprint:

1. Use DMA infinite burst capability mode by setting ge_dmaburst_mode in /etc/system. Since the machine uses an UltraSPARCIIi CPU, and DMA infinite burst mode is only applicable to UltraSPARC III or better CPUs, this solution won’t help us. Bummer!

2. Move the Gigabit Ethernet adaptor to a 66MHZ PCI slot. Since all of our slots are 33MHZ, this won’t help us either. Strike 2!

3. Move the Gigabit Ethernet adaptor to it’s own PCI bus. Since the machine we are using has a single PCI bus, we couldn’t use this option either. Youch!

Since these options weren’t applicable to our system, I started digging through our ORCA graphs to find the exact days and times when these errors occurred. After analyzing the graph for all of about 60-seconds, I realized the errors where occurring at the exact same time each week (Monday afternoons). This was the time our weekly backups had been configured to run, and this would definitely saturate all of the available bandwidth. Since the backups were being performed during a busy time of the day, I speculated that the CPU and PCI bus weren’t sufficient to push all of the backup and production traffic. Since the system is not super critical, I plan to fire up busstat(1m) next Monday to prove my theory. I also plan to do some reading to see why layer-2 flow control isn’t implemented. That should theoretically be the “right fix” for this problem.

More to come …