VxFS clear blocks mount option

While reading through the VxFS administrators guide last week, I came across a cool mount option that can be used to zero out file system blocks prior to use:

“In environments where performance is more important than absolute data integrity, the preceding situation is not of great concern. However, for environments where data integrity is critical, the VxFS file system provides a mount -o blkclear option that guarantees that uninitialized data does not appear in a file.”

This is pretty cool, and a useful feature for environments that are super concerned about data integrity

Preallocating files sequentially on VxFS file systems

One cool feature that is built into VxFS is the ability to preallocate files sequentially on disk. This capability can benefit sequential workloads, and will typically result in higher throughput since disk seek times are minimized (LBA addressing, disk drive defect management and storage array abstractions can sometimes obscure this, so this may not always be 100% accurate).

To use the VxFS preallocation features, a file first needs to be created:

$ dd if=/dev/zero of=oradata01.dbf count=2097152
2097152+0 records in
2097152+0 records out

In this example, I created a 1GB file (2097152 blocks * 512-bytes per block gives us 1GB) named oradata01.dbf, and double checked that it was 1GB by running ls with the “-h” option:

$ ls -lh

total 3.1G
-rw-r--r--  1 root root 1.0G Aug 25 09:06 oradata01.dbf

After a file of the correct size has been allocated, the setext utility can be used to reserve blocks for that file, and to create an extent that matches the number of blocks allocated to the file:

$ setext -r 2097152 -e 2097152 oradata01.dbf

To verify the settings that were assigned to the file, the getext utility can be used:

$ getext oradata01.dbf

oradata01.dbf:  Bsize  1024  Reserve 2097152  Extent Size 2097152

This is an awesome feature, and yet another reason why VxFS is one of the best file systems available today!

Backing up the Veritas Cluster Server configuration

Veritas cluster server stores custom agents and it’s configuration data as a series of files in /etc, /etc/VRTSvcs/conf/config and /opt/VRTSvcs/bin/ directories. Since these files are the life blood of the cluster engine, it is important to backup these files to ensure cluster recovery should disaster hit. VCS comes with the hasnap utility to simplify cluster configuration backups, and when run with the “-backup,” “-n,” “-f <name of file to store snapshot>,” and “-m <description of snapshot>” options, a point in time snapshot of the cluster configuration will be written to the file passed to the “-f” option:

$ hasnap -backup -f clusterbackup.zip -n -m “Backup from March 25th 2007”

Starting Configuration Backup for Cluster foo

Dumping the configuration...

Registering snapshot "foo-2006.08.25-1156511358610"

Contacting host lnode1...

Error connecting to the remote host "lnode1"

Starting backup of files on host lnode2
"/etc/VRTSvcs/conf/config/types.cf" ----> 1.0
"/etc/VRTSvcs/conf/config/main.cf" ----> 1.0
"/etc/VRTSvcs/conf/config/vcsApacheTypes.cf" ----> 1.0
"/etc/llthosts" ----> 1.0
"/etc/gabtab" ----> 1.0
"/etc/llttab" ----> 1.0
"/opt/VRTSvcs/bin/vcsenv" ----> 1.0
"/opt/VRTSvcs/bin/LVMVolumeGroup/monitor" ----> 1.0
"/opt/VRTSvcs/bin/LVMVolumeGroup/offline" ----> 1.0
"/opt/VRTSvcs/bin/LVMVolumeGroup/online" ----> 1.0
"/opt/VRTSvcs/bin/LVMVolumeGroup/clean" ----> 1.0
"/opt/VRTSvcs/bin/ScriptAgent" ----> 1.0
"/opt/VRTSvcs/bin/LVMVolumeGroup/LVMVolumeGroup.xml" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/fdsched" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/monitor" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/fdsetup.vxg" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/open" ----> 1.0
"/opt/VRTSvcs/bin/ScriptAgent" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/RVGSnapshotAgent.pm" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/RVGSnapshot.xml" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/offline" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/online" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/attr_changed" ----> 1.0
"/opt/VRTSvcs/bin/RVGSnapshot/clean" ----> 1.0
"/opt/VRTSvcs/bin/RVGPrimary/monitor" ----> 1.0
"/opt/VRTSvcs/bin/RVGPrimary/open" ----> 1.0
"/opt/VRTSvcs/bin/RVGPrimary/RVGPrimary.xml" ----> 1.0
"/opt/VRTSvcs/bin/RVGPrimary/offline" ----> 1.0
"/opt/VRTSvcs/bin/RVGPrimary/online" ----> 1.0
"/opt/VRTSvcs/bin/RVGPrimary/clean" ----> 1.0
"/opt/VRTSvcs/bin/ScriptAgent" ----> 1.0
"/opt/VRTSvcs/bin/RVGPrimary/actions/fbsync" ----> 1.0
"/opt/VRTSvcs/bin/triggers/violation" ----> 1.0
"/opt/VRTSvcs/bin/CampusCluster/monitor" ----> 1.0
"/opt/VRTSvcs/bin/CampusCluster/close" ----> 1.0
"/opt/VRTSvcs/bin/ScriptAgent" ----> 1.0
"/opt/VRTSvcs/bin/CampusCluster/open" ----> 1.0
"/opt/VRTSvcs/bin/CampusCluster/CampusCluster.xml" ----> 1.0
"/opt/VRTSvcs/bin/RVG/monitor" ----> 1.0
"/opt/VRTSvcs/bin/RVG/info" ----> 1.0
"/opt/VRTSvcs/bin/ScriptAgent" ----> 1.0
"/opt/VRTSvcs/bin/RVG/RVG.xml" ----> 1.0
"/opt/VRTSvcs/bin/RVG/offline" ----> 1.0
"/opt/VRTSvcs/bin/RVG/online" ----> 1.0
"/opt/VRTSvcs/bin/RVG/clean" ----> 1.0
"/opt/VRTSvcs/bin/internal_triggers/cpuusage" ----> 1.0
Backup of files on host lnode2 complete

Backup succeeded partially

To check the contents of the snapshot, the unzip utility can be run with the “-t” option:

$ unzip -t clusterbackup.zip |more

Archive:  clusterbackup.zip
    testing: /cat_vcs.zip             OK
    testing: /categorylist.xml.zip    OK
    testing: _repository__data/vcs/foo/lnode2/etc/VRTSvcs/conf/config/types.cf.zip   OK
    testing: _repository__data/vcs/foo/lnode2/etc/VRTSvcs/conf/config/main.cf.zip   OK
    testing: _repository__data/vcs/foo/lnode2/etc/VRTSvcs/conf/config/vcsApacheTypes.cf.z
ip   OK
    testing: _repository__data/vcs/foo/lnode2/etc/llthosts.zip   OK
    testing: _repository__data/vcs/foo/lnode2/etc/gabtab.zip   OK
    testing: _repository__data/vcs/foo/lnode2/etc/llttab.zip   OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/vcsenv.zip   OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/LVMVolumeGroup/monitor.zip 
  OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/LVMVolumeGroup/offline.zip 
  OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/LVMVolumeGroup/online.zip  
 OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/LVMVolumeGroup/clean.zip   
OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/LVMVolumeGroup/LVMVolumeGro
upAgent.zip   OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/LVMVolumeGroup/LVMVolumeGro
up.xml.zip   OK
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/RVGSnapshot/fdsched.zip   O
K
    testing: _repository__data/vcs/foo/lnode2/opt/VRTSvcs/bin/RVGSnapshot/monitor.zip   O
K
  ......

Since parts of the cluster configuration ran reside in memory and not on disk, it is a good idea to run “haconf -dump -makero” prior to running hasnap. This will ensure that the current configuration is being backed up, and will allow hasnap “-restore” to restore the correct configuration if disaster hits.

Watching slab usage with slabtop

The Linux kernel uses a slab based allocator to allocate kernel memory. Inside each slab is a collection of objects that have been allocated by one or more kernel subsystems. To monitor slab utilization in realtime, most modern day Linux distributions ship with the slaptop utility. When slabtop is run without any arguments, it displays a nice slab usage summary, and provides details of how various slabs are being used:

$ slabtop

 Active / Total Objects (% used)    : 220410 / 234629 (93.9%)
 Active / Total Slabs (% used)      : 4728 / 4728 (100.0%)
 Active / Total Caches (% used)     : 91 / 139 (65.5%)
 Active / Total Size (% used)       : 16609.59K / 18017.14K (92.2%)
 Minimum / Average / Maximum Object : 0.01K / 0.08K / 128.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 80591  80494  99%    0.02K    397      203      1588K avtab_node
 53562  53562 100%    0.03K    474      113      1896K size-32
 35108  35099  99%    0.05K    524       67      2096K buffer_head
  7378   7377  99%    0.27K    527       14      2108K radix_tree_node
  7182   7182 100%    0.14K    266       27      1064K dentry_cache
  6499   6497  99%    0.05K     97       67       388K selinux_inode_security
  5460   5398  98%    0.04K     65       84       260K sysfs_dir_cache
  3953   3425  86%    0.06K     67       59       268K size-64
  3663   3644  99%    0.41K    407        9      1628K inode_cache
  3654    479  13%    0.02K     18      203        72K biovec-1   
  2904   1882  64%    0.09K     66       44       264K vm_area_struct
  2580   2442  94%    0.12K     86       30       344K size-128
  2080    924  44%    0.19K    104       20       416K filp
  2070    744  35%    0.12K     69       30       276K bio 
  1944    660  33%    0.05K     27       72       108K journal_head
  1896   1892  99%    0.59K    316        6      1264K ext3_inode_cache
  1595    525  32%    0.02K     11      145        44K anon_vma
  1288   1209  93%    0.04K     14       92        56K Acpi-Operand
   845    730  86%    0.02K      5      169        20K Acpi-Namespace
   648    511  78%    0.05K      9       72        36K avc_node
   540    471  87%    0.43K     60        9       240K proc_inode_cache

These is a sweet utility, and is another one of those tools that should be in every Linux administrators tool belt.

Using the Solaris coreadm utility to control core file generation

Solaris has shipped with the coreadm utiltiy for quite some time, and this nifty little utility allows you to control every facet of core file generation. This includes the ability to control where core files are written, the name of core files, which portions of the processes address space will be written to the core file, and my favorite option, whether or not to generate a syslog entry indicating that a core file was generated.

To begin using coreadm, you will first need to run it wit the “-g” option to specify where core files should be stored, and the pattern that should be used when creating the core file:

$ coreadm -g /var/core/core.%f.%p

Once a directory and file pattern are specified, you can optionally adjust which portions of the processes address space (e.g., text segment, heap, ISM, etc.) will be written to the core file. To ease debugging, I like to configure coreadm to dump everything with the”-G all” option:

$ coreadm -G all

Since core files are typically created at odd working hours, I also like to configure coreadm to log messages to syslog indicating that a core file was created. This can be done by using the coreadm “-e log” option:

$ coreadm -e log

After these settings are adjusted, the coreadm “-e global” option can be used to enable global core file generation, and the coreadm utility can be run without any arguments to view the settings (which are stored in /etc/coreadm.conf):

$ coreadm -e global

$ coreadm

     global core file pattern: /var/core/core.%f.%p
     global core file content: all
       init core file pattern: core
       init core file content: default
            global core dumps: enabled
       per-process core dumps: enabled
      global setid core dumps: disabled
 per-process setid core dumps: disabled
     global core dump logging: enabled

Once global core file support is enabled, each time a process receives a deadly signal (e.g., SIGSEGV, SIGBUS, etc.):

$ kill -SIGSEGV 4652

A core file will be written to /var/core:

$ ls -al /var/core/*4652

-rw-------   1 root     root     4163953 Mar  9 11:51 /var/core/core.inetd.4652

And a message similar to the following will appear in the system log:

Mar 9 11:51:48 fubar genunix: [ID 603404 kern.notice] NOTICE: core_log: inetd[4652] core dumped: /var/core/core.inetd.4652

This is an amazingly useful feature, and can greatly simplify root causing software problems.

Measuring system call time with procsystime

When debugging application performance problems related to high system time, I typically start my analysis by watching the system calls the application is issuing, and measuring how much time is spent in each system call. Gathering this information is simple with DTrace syscall provider, and the DTraceToolkit comes with the procsystime script to allow admins to easily analyze system call behavior. To use procsystime to measure how much time the sshd proceses are spending in each system call, we can run procsystime with the “-T” option to get the total time spent in all system calls, the “-n” option, and the process name to analyze (in the example below, using the string “sshd” will cause procsystime to analyze the system call behavior for all processes named sshd):

$ procsystime -Tn sshd

Hit Ctrl-C to stop sampling...
^C

Elapsed Times for processes sshd,

         SYSCALL          TIME (ns)
           umask               9804
         setpgrp              12111
             nfs              12194
        pathconf              12973
           chdir              13656
        setregid              20676
        setreuid              22364
      getdents64              27036
       getgroups              28605
        lwp_self              29808
      getsockopt              30959
          setgid              31365
           alarm              31507
            zone              31691
          setuid              33861
      setsockopt              39311
         setegid              39660
      setcontext              40061
         seteuid              40149
           lseek              41430
             dup              43978
         c2audit              44604
     getsockname              51978
         privsys              52974
         waitsys              56247
          getgid              57050
          accept              59959
            fsat              76925
       setgroups              79791
         tasksys              81609
      systeminfo              91019
       sysconfig             112749
        recvfrom             127964
          access             138435
            pipe             144371
     getpeername             157330
          fxstat             175541
        schedctl             182224
           vfork             217932
          putmsg             241482
         connect             301860
       sigaction             317411
        shutdown             328316
             brk             356044
       so_socket             409152
          getpid             484735
           fcntl             494322
           gtime             526637
          stat64             539840
          llseek             624945
     resolvepath             678157
          open64             715602
          getuid             950119
         fstat64             964132
           ioctl            1171727
           xstat            1278052
         memcntl            1394735
            send            1846685
           close            2325223
            open            2685141
            mmap            5289087
          munmap            5379678
     lwp_sigmask            6178493
           exece           11787526
          doorfs           28604988
           write           46083911
           fork1           57233817
            read           96877372
         pollsys        25533333727
          TOTAL:        25811904817

The output will contain the name of the system call in the left hand column, and the time spent in that system call in the right hand column. There are additional options to display the number of calls to each system call, and you can also filter by process id if you want to measure a specific process. If you are running Solaris 10 and haven’t downloaded the DTraceToolkit, I highly recommend doing so!!!