CFengine 3 Tutorial -- Part 1 -- System Architecture

This article was posted by Matty on 2010-07-02 10:54:00 -0400 -0400

I recently stood up a CFengine 3 configuration management infrastructure and took notes during the process to share with my team. This was my first attempt at using CFengine, so hopefully this multi-part overview will help others trying to bootstrap their environments as well. Many of these notes were taken from the CFengine 3 reference manual and tutorial found on the docs website here. There is some excellent documentation on the CFengine.org so if you have more questions about something specific, be sure to check out the reference manuals! Neil Watson has also compiled an excellent tutorial on his CFengine 3 setup. I organized some of the structure of my config files from his examples. There is also theCFengine help mailing list. You can browse thearchives through the web here. Some of the details in the following documentation (building software, SMF scripts) may be Solaris 10 specific as that was the platform I was working with.

High Level Architecture and Objectives What are some examples of what CFEngine can do?

Performing post-installation tasks such as configuring the network interface.
Editing system configuration files and other files.
Creating symbolic links.
Checking and correcting file permissions and ownership.
Deleting unwanted files.
Compressing selected files.
Distributing files within a network.
Automatically mounting NFS file systems.
Verifying the presence and integrity of important files and file systems.
Executing commands and scripts.
Applying security-related patches and similar system corrections.
Managing system server processes.
Makes sandwiches via sudo.

Fundamental concepts, rules, and terms CFEngine uses.

Host: Generally, a host is a single computer that runs an operating system like UNIX, Linux, or Windows. We will sometimes talk about machines too, and a host can also be a virtual machine supported by an environment such as VMware or Xen/Linux.
Policy: This is a specification of what we want a host to be like. Rather than be in any sort of computer program, a policy is essentially a piece of documentation that describes technical details and characteristics. Cfengine implements policies that are specified via directives.
Configuration: The configuration of a host is the actual state of its resources
Operation: A unit of change is called an operation. CFEngine deals with changes to a system, and operations are embedded into the basic sentences of a cfengine policy. They tell us how policy constrains a host — in other words, how we will prevent a host from running away.
Convergence: An operation is convergent if it always brings the configuration of a host closer to its ideal state and has no effect if the host is already in that state.
Classes: A class is a way of slicing up and mapping out the complex environment of one or more hosts in to regions that can then be referred to by a symbol or name. They describe scope: where something is to be constrained.
Autonomy: No cfengine component is capable of receiving information that it has not explicitly asked for itself.
Scalable distributed action: Each host is responsible for carrying out checks and maintenance on/for itself, based on its local copy of policy.
The fact that each cfengine agent keeps a local copy of policy (regardless of whether it was written locally or inherited from a central authority) means that cfengine will continue to function even if network communications are down.

Critical CFEngine Daemons and Commands

cf-agent: Interprets policy promises and implements them in a convergent manner. The agent fetches data from cf-servd running on the Master Policy Servers.
cf-execd: Executes cf-agent and logs its output (optionally sending a summary via email). It can be run in daemon (standalone) mode. We have configured Solaris’ SMF to keep cf-execd online, which drives cf-agent.
cf-serverd: Monitors the cfengine port: serves file data to cf-agent. Every bit of data that we transfer between cf-agent and cf-serverd is encrypted.
cf-monitord: Collects statistics about resource usage on each host for anomaly detection purposes. The information is made available to the agent in the form of cfengine classes so that the agent can check for and respond to anomalies dynamically.
cf-key: Generates public-private key pairs on a host. You normally run this program only once, as part of the cfengine software installation process.

On a client system, cf-agent will be executed automatically by the cf-execd daemon; the latter also handles logging during cf-agent runs. In addition, operations such as file copying between hosts are initiated by cf-agent on the local system, and they rely on the cf-serverd daemon on the Master Policy Server to obtain remote data.**

High Level Architecture of pushing configurations

SVN becomes the source of truth for CFEngine. The Architecture we are using will allow us to start with only one “Master Policy Server” or “Distribution Server” per site, but we can easily scale to multiple machines if wanted.

A cron entry on the Master Policy Server will check the SVN repository at svn:/// every minute. If a updated configuration is detected, it will download the client configurations into /var/cfengine/masterfiles on the Master Policy Server.
Depending upon the value configured for “splaytime” on the clients, they will check in randomly over a given period of, say, 10 minutes. The new policy file that was downloaded to /var/cfengine/masterfiles will be served by cf-serverd on the Master Policy Server and transferred (with encryption) to the client by the cf-agent command and pulled into /var/cfengine/inputs.
The client runs the cf-execd daemon through SMF. The cf-execd daemon peridoically wakes up to execute cf-agent which runs the policies in /var/cfengine/inputs. If a new policy was transferred to the client, cf-agent will execute it.

The data flow on performing a change is as follows: Pushing Configuration Changes**

I make a config change on my local machine and push to SVN. push —-> SVN
Updated configuration detected. Download changes via cron script into /var/cfengine/masterfiles on policy server <—- pull from SVN
Policy Server running cf-serverd now has updated configurations in /var/cfengine/masterfiles to push to clients <—— pull from SVN from cron script
Clients running cf-execd daemon execute cf-agent based upon schedule (by default every 5 minutes)
cf-agent looks at configured “splaytime” variable to figure out how long to wait before contacting cf-serverd. (compute hash and randomly check in over interval) This random “back off” time keeps the master policy server from being hammered all at once by thousands of clients. If we randomly check in over a 10 minute interval, then we have less bursts of network i/o, etc… **6. cf-agent contacts cf-serverd running on Master Policy Server(s) and pulls updated policies / configs / etc via encrypted link. This happens via execution of failsafe.cf and update.cf <—— pull from Master Policy Servers. Clients pull. Servers don’t “push”. Changes are done on the client opportunistically. If the network is down, nothing happens on the clients. The next time the client can contact the Master Policy Server, the change is executed.
cf-agent executes policies via promises.cf. Changes happen on the client here.**
cf-execd records details of the execution of promises.cf and records what happened into /var/cfengine/outputs.
cf-monitord records behavior of the machine and records details in /var/cfengine/reports
cf-execd kept running / monitored by Solaris SMF on client.
cf-monitord kept running / monitored by Solaris SMF on client.
cf-report ran manually through the CLI. cf-report analyzes data collected by cf-monitord in /var/cfengine/reports. Outputs to html / text / XML / etc…
Predefined schedule of XXX minutes passes again and cf-execd executes cf-agent again. Repeat from step 4.

Why does everything reside in /var/cfengine? How is CFengine resilient to failures?

Cfengine likes to keep itself as resilient as possible. Some environments have /usr/local NFS mounted, so /var/cfengine was chosen as it was pretty much guaranteed to be kept locally on disk.

Binaries that get executed reside in /var/cfengine/bin. Pristine copies of binaries reside in /var/cfengine/sbin. Every time cf-agent executes failsafe.cf (which calls update.cf), it verifies that the MD5 digest of the binaries in /var/cfengine/bin match /var/cfengine/sbin. If they don’t match, permissions have changed, ownership, etc. then they will automatically be copied from /var/cfengine/sbin to /var/cfengine/bin. This is a fail safe protection mechanism that will attempt to have CFEngine automatically recover itself from some sort of corruption.
If you look at the “Part 2 — How I compiled CFEngine” page, you’ll see that we manually changed some configurations in the Makefile. This was to ensure that libpcre, libgcc.so.1, and libcrypto.a were statically compiled into the CFEngine client binaries. We dont want to have CFEngine rely on software under /usr/sfw/lib or /usr/local/lib – its completely self contained in /var/cfengine (other than general system libraries.)
cf-agent actually gets executed twice on each run. The first run is to update all policy files via execution of failsafe.cf from the master policy server, but not to actually execute the policies. The second run executes promises.cf and really performs the changes. We modify promises.cf. We never modify failsafe.cf or update.cf once in production.
This allows us to have syntax errors in promises.cf, but allow the clients to recover themselves in an automated fashion. If promises.cf is corrupt, we can’t actually execute policies. But if failsafe.cf and update.cf are in a good state, the clients will continue to poll the master policy server for updated copies of files.
We can correct promises.cf from our syntax error — clients will pull the updated and corrected promises.cf, and the auto-recovery process of the configs is complete.
If you break failsafe.cf or update.cf on the clients, then the clients will have to be touched manually to recover. Don’t modify these configurations once in a production environment — or be extremely careful to test your changes if you absolutely must.

Getting an accurate view of process memory usage on Linux hosts

This article was posted by Matty on 2010-07-02 08:06:00 -0400 -0400

Having debugged a number of memory-related issues on Linux, one thing I’ve always wanted was a tool to display proportional memory usage. Specifically, I wanted to be able to see how much memory was unique to a process, and have an equal portion of shared memory (libraries, SMS, etc.) added to this value. My wish came true a while back when I discovered the smem utility. When run without any arguments, smem will give you the resident set size (RSS), the unique set size (USS) and the proportional set size (PSS) which is the unique set size plus a portion of the shared memory that is being used by this process. This results in output similar to the following:

$ smem -r

PID User Command Swap USS PSS RSS
3636 root /usr/lib/virtualbox/Virtual 0 1151596 1153670 1165568
3678 matty /usr/lib64/firefox-3.5.9/fi 0 189628 191483 203028
5779 root /usr/bin/python /usr/bin/sm 0 38584 39114 40368
1847 root /usr/bin/Xorg :0 -nr -verbo 0 34024 35874 92504
4103 matty pidgin 0 19364 21072 32412
3825 matty gnome-terminal 0 12388 13242 21992
3404 matty python /usr/share/system-co 0 11596 12622 19216
3710 matty gnome-screensaver 0 9872 10287 14640
3283 matty nautilus 0 7104 8373 18484
3263 matty gnome-panel 0 5828 6731 15780

To calculate the portion of shared memory that is being used by each process, you can add up the shared memory per process (you would probably index this by the type of shared resource), the number of processes using these pages, and then divide the two values to get a proportional value of shared memory per process. This is a very cool utility, and one that gets installed on all of my systems now!

Getting DNS ping (aka nsping) to compile on Linux hosts

This article was posted by Matty on 2010-07-01 11:07:00 -0400 -0400

While debugging a DNS issue this week, I wanted to run my trusty old friend nsping on my Linux desktop. I grabbed the source from the FreeBSD source site, checked to make sure the bits were legit, then proceeded to compile it:

$ make

cc -g -c -o nsping.o nsping.c
In file included from nsping.c:13:
nsping.h:45: error: conflicting types for ‘dprintf’
/usr/include/stdio.h:399: note: previous declaration of ‘dprintf’ was
here
nsping.c:613: error: conflicting types for ‘dprintf’
/usr/include/stdio.h:399: note: previous declaration of ‘dprintf’ was
here

Erf! The source archive I downloaded didn’t compile, and from the error message it appears the function definition for dprintf conflicts with a function definition in libc. Instead of mucking around with map files, I changed all occurrences of dprintf to ddprintf. When I ran make again I got a new error:

$ make

cc -g -c -o nsping.o nsping.c
cc -g -c -o dns-lib.o dns-lib.c
dns-lib.c: In function ‘dns_query’:
dns-lib.c:22: warning: incompatible implicit declaration of built-in
function ‘memset’
cc -g -c -o dns-rr.o dns-rr.c
dns-rr.c: In function ‘dns_rr_query’:
dns-rr.c:26: warning: return makes integer from pointer without a cast
cc -g -o nsping nsping.o dns-lib.o dns-rr.o
dns-rr.o: In function `dns_string':
/home/matty/Download/nsping-0.8/dns-rr.c:63: undefined reference to
`__dn_comp'
collect2: ld returned 1 exit status

This error message indicates that the dn_comp symbol couldn’t be resolved. This function typically resides in the resolver library, so working around this was as easy as assing “-lresolv” to the LIBS variable in the nsping Makefile:

LIBS = -lresolv

Once this change was made, everything compiled and ran flawlessly:

$ make

cc -g -o nsping nsping.o dns-lib.o dns-rr.o -lresolv

$ ./nsping -h

./nsping: option requires an argument -- 'h'
nsping [ -z <zone> | -h <hostname> ] -p <port> -t <timeout>
-a <local address> -P <local port>
-T <type> <-r | -R, recurse?>

Debugging this issue was a bunch of fun, and reading through the nsping source code was extremely educational. Not only did I learn more about how libresolv works, but I found out sys5 sucks:

#ifdef sys5
#warning "YOUR OPERATING SYSTEM SUCKS."

You gotta love developers who are straight and to the point! ;)

Ridding your Solaris host of zombie processes

This article was posted by Matty on 2010-06-30 17:28:00 -0400 -0400

We encountered a nasty bug in our backup software this week. When this bug is triggered, each job (one process is created per job) that completes will turn into a zombie. After a few days we will have hundreds or even thousands of zombie processes, which if left unchecked will eventually lead to the system-side process table filling up. Solaris comes with a nifty tool to help deal with zombies (no, they don’t ship you a shotgun with your media kit), and it comes by the name preap. To use preap, you can pass it the PID of the zombie process you want to reap:

$ ps -ef | grep defunct

root 646 426 0 - ? 0:00 <defunct>
root 1489 12335 0 09:32:54 pts/1 0:00 grep defunct

$ preap 646

646: exited with status 0

This will cause the process to exit, and the kernel can then free up the resources that were allocated by that process. On a related note, if you haven’t seen the movie zombieland you are missing out!!!! That movie is hilarious!

Sorting data by dates, numbers and much much more

This article was posted by Matty on 2010-06-24 08:56:00 -0400 -0400

Every year or two I try to re-read manual pages and documentation about my favorite UNIX tools (bash, awk, sed, grep, etc.). Each time I do this I pick up some cool new nugget of information, and refresh my mind on things that I may have forgot. While reading through an article on sort, I came across the following note about the sort “-k” (field to sort by) option:

“Further modifications of the sorting algorithm are possible with these options: -d (use only letters, digits, and blanks for sort keys), -f (turn off case recognition and treat lowercase and uppercase characters as identical), -i (ignores non-printing ASCII characters), -M (sorts lines using three-letter abbreviations of month names: JAN, FEB, MAR, …), -n (sorts lines using only digits, -, and commas, or other thousands separator). These options, as well as -b and -r, can be used as part of a key number, in which case they apply to that key only and not globally, like they do when they are used outside key definitions."*

This is crazy useful, and I didn’t realize sort could be used to sort by date. I put this to use today, when I had to sort a slew of data that looked similar to this:

Jun 10 05:17:47 some_data_string
May 20 05:17:48 some_data_string2
Jun 17 05:17:49 some_data_string0

I was able to first sort by the month, and then by the day of the month:

$ awk '{printf "%-3s %-2s %-8s %-50s ", 1, 2, 3, 4 }' data | sort -k1M -k2n

May 17 05:17:49 some_data_string0
Jun 01 05:17:47 some_data_string
Jun 20 05:17:48 some_data_string2

Awesome stuff, and I will definitely be using this again in the future!!!