Blog O' Matty


Testing SSL services

This article was posted by Matty on 2005-11-05 10:13:00 -0400 -0400

If you manage web applications and servers, you may have encountered a poorly written application or a web server that periodically hangs for no reason. These issues usually pop up out of the blue, and most people rely on their user community to notifiy them when problems are detected. To ensure timely notifications when these problems occur, I developed ssl-service-check. ssl-service-check is written in Bourne shell, and uses the OpenSSL toolkit to connect to a service and issue a “GET /.” If the service fails to respond, ssl-cervice-check will log an error to syslog and send an e-mail to the address defined in the global ADMINS variable. To test if the prefetch.net web server is handling requests on TCP port 444, we can execute ssl-service-check with the “-s” (server to connect to) and “-p” (port number to connect to) options:

$ ssl-service-check.sh -s mail.prefetch.net -p 444

$ tail -1 /var/adm/messages

Nov 3 18:23:28 tigger matty: [ID 702911 daemon.notice] Failed to connect to mail.prefetch.net on Port 444

ssl-service-check was written to work with cron, and can easily be integrated with a network monitoring solution.

Viewing t.v. program listings

This article was posted by Matty on 2005-11-05 09:59:00 -0400 -0400

While checking my daily news sources, I came across a review of several opensource t.v. program listing applications. If freeguide adds a search feature in a future release, I will definitely check it out (that way I can easily locate Seinfeld reruns).

Adding mirrors to Veritas Volume Manager volumes

This article was posted by Matty on 2005-11-03 19:11:00 -0400 -0400

One of the cool features of Veritas Volume Manager (VxVM) is it’s ability to change the layout of a volume on the fly with vxasssist(1m). This option has helped me numerous times, especially when I needed to mirror volumes that weren’t mirrored. Given the following unmirrored striped volume:

$ vxprint -hft

Disk group: oradg

DG NAME NCONFIG NLOG MINORS GROUP-ID
ST NAME STATE DM_CNT SPARE_CNT APPVOL_CNT
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
CO NAME CACHEVOL KSTATE STATE
VT NAME NVOLUME KSTATE STATE
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO

dg oradg default default 10000 1127240283.19.winnie

dm c1t1d0 c1t1d0s2 auto 2048 35521408 -
dm c1t2d0 c1t2d0s2 auto 2048 35521408 -
dm c1t3d0 c1t3d0s2 auto 2048 35521408 -
dm c1t4d0 c1t4d0s2 auto 2048 35365968 -
dm c1t5d0 c1t5d0s2 auto 2048 35521408 -
dm c1t6d0 c1t6d0s2 auto 2048 35521408 -

v oravol01 - ENABLED ACTIVE 20971520 SELECT oravol01-01 fsgen
pl oravol01-01 oravol01 ENABLED ACTIVE 20971776 STRIPE 3/128 RW
sd c1t1d0-01 oravol01-01 c1t1d0 0 6990592 0/0 c1t1d0 ENA
sd c1t2d0-01 oravol01-01 c1t2d0 0 6990592 1/0 c1t2d0 ENA
sd c1t3d0-01 oravol01-01 c1t3d0 0 6990592 2/0 c1t3d0 ENA

We can easily add a mirror by invoking vxassist(1m) with the “mirror” option:

$ vxassist mirror oravol01 layout=stripe ncol=3 &

The mirror option accepts a layout option and several keywords to control the layout of the new mirror. In this example we used a 3-column striped plex to match the layout of the existing plex. After the mirror operation completes, the volume will contain a second plex (the mirror) that matches the original:

$ vxprint -hft

Disk group: oradg

DG NAME NCONFIG NLOG MINORS GROUP-ID
ST NAME STATE DM_CNT SPARE_CNT APPVOL_CNT
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
CO NAME CACHEVOL KSTATE STATE
VT NAME NVOLUME KSTATE STATE
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO

dg oradg default default 10000 1127240283.19.winnie

dm c1t1d0 c1t1d0s2 auto 2048 35521408 -
dm c1t2d0 c1t2d0s2 auto 2048 35521408 -
dm c1t3d0 c1t3d0s2 auto 2048 35521408 -
dm c1t4d0 c1t4d0s2 auto 2048 35365968 -
dm c1t5d0 c1t5d0s2 auto 2048 35521408 -
dm c1t6d0 c1t6d0s2 auto 2048 35521408 -

v oravol01 - ENABLED ACTIVE 20971520 SELECT - fsgen
pl oravol01-01 oravol01 ENABLED ACTIVE 20971776 STRIPE 3/128 RW
sd c1t1d0-01 oravol01-01 c1t1d0 0 6990592 0/0 c1t1d0 ENA
sd c1t2d0-01 oravol01-01 c1t2d0 0 6990592 1/0 c1t2d0 ENA
sd c1t3d0-01 oravol01-01 c1t3d0 0 6990592 2/0 c1t3d0 ENA
pl oravol01-02 oravol01 ENABLED ACTIVE 20971776 STRIPE 3/128 RW
sd c1t4d0-01 oravol01-02 c1t4d0 0 6990592 0/0 c1t4d0 ENA
sd c1t5d0-01 oravol01-02 c1t5d0 0 6990592 1/0 c1t5d0 ENA
sd c1t6d0-01 oravol01-02 c1t6d0 0 6990592 2/0 c1t6d0 ENA

Veritas makes managing storage a snap!

Accessing metadevices in single user mode

This article was posted by Matty on 2005-11-03 19:04:00 -0400 -0400

While perusing Sunsolve today, I came across an infodoc that described how to access metadevices in single user mode. I have never needed to perform this operation, but was happy to learn for future reference! :)

Determining the reason behind rx_overflow values (part 1).

This article was posted by Matty on 2005-11-01 22:30:00 -0400 -0400

While performing some routine checks on one of the servers I support, I noticed numerous input errors on Gigabit Ethernet interface zero:

$ netstat -i

Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
lo0 8232 loopback localhost 959 0 959 0 0 0
ge0 1500 server1 server1 713548208 155599 680686711 0 0 0

Since this was a Sun server running Solaris 9, I fired up the kstat(1m) utility to find the cause of these errors:

$ kstat -m ge -i 0

module: ge instance: 0
name: ge0 class: net

align_errors 0
allocbfail 0
brdcstrcv 72370571
brdcstxmt 2878
carrier_errors 0
collisions 0
crtime 8.264993338
defer_timer_exp 0
defer_xmts 0
drop 4239759
ex_collisions 0
excessive_coll 0
fcs_errors 0
first_coll 0
ge_csumerr 8
ge_queue_cnt 0
ge_queue_full_cnt 0
ierrors 155599
ifspeed 1000000000
inits 28
ipackets 713548130
ipackets64 713548130
jabber 0
late_coll 0
link_up 1
mac_mode 2
macrcv_errors 0
macxmt_errors 0
multircv 14521873
multixmt 0
no_free_rx_desc 0
no_tmds 0
nocanput 3625
nocarrier 1
norcvbuf 0
noxmtbuf 0
obytes 978892756
obytes64 615159216084
oerrors 0
opackets 680686711
opackets64 680686711
pause_off_cnt 0
pause_on_cnt 0
pause_rcv_cnt 0
pause_time_cnt 0
pci_badack 0
pci_bus_speed 33
pci_bus_width 0
pci_data_parity_err 0
pci_det_parity_err 0
pci_dtrto 0
pci_rcvd_master_abort 0
pci_rcvd_target_abort 0
pci_signal_system_err 0
pci_signal_target_abort 0
peak_attempt_cnt 0
rbytes 2933894033
rbytes64 535509838737
rcv_dma_mode 2
rx_align_err 0
rx_code_viol_err 0
rx_crc_err 0
rx_error_ack 0
rx_hang 0
rx_late_error 0
rx_length_err 0
rx_overflow 155578
rx_parity_error 0
rxinits 0
rxtag_error 0
slv_error_ack 0
slv_parity_error 0
snaptime 8369381.49486557
sqe_errors 0
toolong_errors 0
tx_error_ack 0
tx_late_error 0
tx_parity_error 0
txinits 0
txmac_maxpkt_err 0
txmac_urun 0
xmit_dma_mode 6

After reviewing the kstat(1m) output I noticed that the rx_overflow value was well in excess of 150k. Since the word “overflow” is never a good sign, I started to research this issue by reading the manual page for gld(7D). This page contains descriptions for the generic LAN driver (gld) kstat values, but for some reason didn’t include a description for rx_overflow (the name is self-evident, but I wanted a definite answer). After a quick Google search I came across the following information in the the Sun Maximizing Performance of a Gigabit Ethernet NIC Interface blueprint:

“When rx_overflow is incrementing, packet processing is not keeping up with the packet arrival rate. If it is incrementing and no_free_rx_desc is not, this indicates that the PCI bus or SBus bus is presenting an issue to the flow of packets through the device. This could be because the ge card is plugged into a slower I/O bus. You can confirm the bus speed by looking at the pci_bus_speed statistic. An SBus bus speed of 40 MHz or a PCI bus speed of 33 MHz might not be sufficient to sustain full bidirectional one-gigabit Ethernet traffic. Another scenario that can lead to rx_overflow incrementing on its own is sharing the I/O bus with another device that has similar bandwidth requirements to those of the ge card.”*

After reading through the blueprint, I used the blueprint’s advice and checked the no_free_rx_desc value. The no_free_rx_desc value was set to zero, so I again used the blueprint’s advice and checked the hardware configuration. I first reviewed the prtdiag(1m) output to get the server identification string, and then turned to the Sunsolve FE handbook. The handbook indicated that the that the PCI bus ran at a clock rate of 33MHZ, and the prtdiag(1m) output indiciated that the disk controller and GE adaptor shared the PCI bus. To ensure that disk I/O bandwidth wasn’t a problem, I fired up iostat(1m) ad monitored the number of bytes written per second. There was little I/O traffic, so it didn’t seem to be a bus congestion problem. Next I reviewed the recommended solutions in the blueprint:

  1. Use DMA infinite burst capability mode by setting ge_dmaburst_mode in /etc/system. Since the machine uses an UltraSPARCIIi CPU, and DMA infinite burst mode is only applicable to UltraSPARC III or better CPUs, this solution won’t help us. Bummer!
  2. Move the Gigabit Ethernet adaptor to a 66MHZ PCI slot. Since all of our slots are 33MHZ, this won’t help us either. Strike 2!
  3. Move the Gigabit Ethernet adaptor to it’s own PCI bus. Since the machine we are using has a single PCI bus, we couldn’t use this option either. Youch!

Since these options weren’t applicable to our system, I started digging through our ORCA graphs to find the exact days and times when these errors occurred. After analyzing the graph for all of about 60-seconds, I realized the errors where occurring at the exact same time each week (Monday afternoons). This was the time our weekly backups had been configured to run, and this would definitely saturate all of the available bandwidth. Since the backups were being performed during a busy time of the day, I speculated that the CPU and PCI bus weren’t sufficient to push all of the backup and production traffic. Since the system is not super critical, I plan to fire up busstat(1m) next Monday to prove my theory. I also plan to do some reading to see why layer-2 flow control isn’t implemented. That should theoretically be the “right fix” for this problem.

More to come …