I recently stood up a CFengine 3 configuration management infrastructure and took notes during the process to share with my team. This was my first attempt at using CFengine, so hopefully this multi-part overview will help others trying to bootstrap their environments as well. Many of these notes were taken from the CFengine 3 reference manual and tutorial found on the docs website here. There is some excellent documentation on the CFengine.org so if you have more questions about something specific, be sure to check out the reference manuals!
Neil Watson has also compiled an excellent tutorial on his CFengine 3 setup. I organized some of the structure of my config files from his examples. There is also the CFengine help mailing list. You can browse the archives through the web here. Some of the details in the following documentation (building software, SMF scripts) may be Solaris 10 specific as that was the platform I was working with.
High Level Architecture and Objectives
What are some examples of what CFEngine can do?
* Performing post-installation tasks such as configuring the network interface.
* Editing system configuration files and other files.
* Creating symbolic links.
* Checking and correcting file permissions and ownership.
* Deleting unwanted files.
* Compressing selected files.
* Distributing files within a network.
* Automatically mounting NFS file systems.
* Verifying the presence and integrity of important files and file systems.
* Executing commands and scripts.
* Applying security-related patches and similar system corrections.
* Managing system server processes.
* Makes sandwiches via sudo.
Fundamental concepts, rules, and terms CFEngine uses.
1. Host: Generally, a host is a single computer that runs an operating system like UNIX, Linux, or Windows. We will sometimes talk about machines too, and a host can also be a virtual machine supported by an environment such as VMware or Xen/Linux.
2. Policy: This is a specification of what we want a host to be like. Rather than be in any sort of computer program, a policy is essentially a piece of documentation that describes technical details and characteristics. Cfengine implements policies that are specified via directives.
3. Configuration: The configuration of a host is the actual state of its resources
4. Operation: A unit of change is called an operation. CFEngine deals with changes to a system, and operations are embedded into the basic sentences of a cfengine policy. They tell us how policy constrains a host — in other words, how we will prevent a host from running away.
5. Convergence: An operation is convergent if it always brings the configuration of a host closer to its ideal state and has no effect if the host is already in that state.
6. Classes: A class is a way of slicing up and mapping out the complex environment of one or more hosts in to regions that can then be referred to by a symbol or name. They describe scope: where something is to be constrained.
7. Autonomy: No cfengine component is capable of receiving information that it has not explicitly asked for itself.
8. Scalable distributed action: Each host is responsible for carrying out checks and maintenance on/for itself, based on its local copy of policy.
9. The fact that each cfengine agent keeps a local copy of policy (regardless of whether it was written locally or inherited from a central authority) means that cfengine will continue to function even if network communications are down.
Critical CFEngine Daemons and Commands
1. cf-agent: Interprets policy promises and implements them in a convergent manner. The agent fetches data from cf-servd running on the Master Policy Servers.
2. cf-execd: Executes cf-agent and logs its output (optionally sending a summary via email). It can be run in daemon (standalone) mode. We have configured Solaris’ SMF to keep cf-execd online, which drives cf-agent.
3. cf-serverd: Monitors the cfengine port: serves file data to cf-agent. Every bit of data that we transfer between cf-agent and cf-serverd is encrypted.
4. cf-monitord: Collects statistics about resource usage on each host for anomaly detection purposes. The information is made available to the agent in the form of cfengine classes so that the agent can check for and respond to anomalies dynamically.
5. cf-key: Generates public-private key pairs on a host. You normally run this program only once, as part of the cfengine software installation process.
On a client system, cf-agent will be executed automatically by the cf-execd daemon; the latter also handles logging during cf-agent runs. In addition, operations such as file copying between hosts are initiated by cf-agent on the local system, and they rely on the cf-serverd daemon on the Master Policy Server to obtain remote data.
High Level Architecture of pushing configurations
Image borrowed from the CFEngine tutorial.
* SVN becomes the source of truth for CFEngine. The Architecture we are using will allow us to start with only one “Master Policy Server” or “Distribution Server” per site, but we can easily scale to multiple machines if wanted.
* A cron entry on the Master Policy Server will check the SVN repository at svn:/// every minute. If a updated configuration is detected, it will download the client configurations into /var/cfengine/masterfiles on the Master Policy Server.
* Depending upon the value configured for “splaytime” on the clients, they will check in randomly over a given period of, say, 10 minutes. The new policy file that was downloaded to /var/cfengine/masterfiles will be served by cf-serverd on the Master Policy Server and transferred (with encryption) to the client by the cf-agent command and pulled into /var/cfengine/inputs.
* The client runs the cf-execd daemon through SMF. The cf-execd daemon peridoically wakes up to execute cf-agent which runs the policies in /var/cfengine/inputs. If a new policy was transferred to the client, cf-agent will execute it.
The data flow on performing a change is as follows:
Pushing Configuration Changes
1. I make a config change on my local machine and push to SVN. push —-> SVN
2. Updated configuration detected. Download changes via cron script into /var/cfengine/masterfiles on policy server <—- pull from SVN
3. Policy Server running cf-serverd now has updated configurations in /var/cfengine/masterfiles to push to clients <—— pull from SVN from cron script
4. Clients running cf-execd daemon execute cf-agent based upon schedule (by default every 5 minutes)
5. cf-agent looks at configured "splaytime" variable to figure out how long to wait before contacting cf-serverd. (compute hash and randomly check in over interval) This random “back off” time keeps the master policy server from being hammered all at once by thousands of clients. If we randomly check in over a 10 minute interval, then we have less bursts of network i/o, etc…
6. cf-agent contacts cf-serverd running on Master Policy Server(s) and pulls updated policies / configs / etc via encrypted link. This happens via execution of failsafe.cf and update.cf <—— pull from Master Policy Servers. **** Clients pull. Servers don’t “push”. Changes are done on the client opportunistically. If the network is down, nothing happens on the clients. The next time the client can contact the Master Policy Server, the change is executed. *****
7. cf-agent executes policies via promises.cf. Changes happen on the client here.
8. cf-execd records details of the execution of promises.cf and records what happened into /var/cfengine/outputs.
9. cf-monitord records behavior of the machine and records details in /var/cfengine/reports
10. cf-execd kept running / monitored by Solaris SMF on client.
11. cf-monitord kept running / monitored by Solaris SMF on client.
12. cf-report ran manually through the CLI. cf-report analyzes data collected by cf-monitord in /var/cfengine/reports. Outputs to html / text / XML / etc…
13. Predefined schedule of XXX minutes passes again and cf-execd executes cf-agent again. Repeat from step 4.
Why does everything reside in /var/cfengine? How is CFengine resilient to failures?
* Cfengine likes to keep itself as resilient as possible. Some environments have /usr/local NFS mounted, so /var/cfengine was chosen as it was pretty much guaranteed to be kept locally on disk.
* Binaries that get executed reside in /var/cfengine/bin. Pristine copies of binaries reside in /var/cfengine/sbin. Every time cf-agent executes failsafe.cf (which calls update.cf), it verifies that the MD5 digest of the binaries in /var/cfengine/bin match /var/cfengine/sbin. If they don’t match, permissions have changed, ownership, etc. then they will automatically be copied from /var/cfengine/sbin to /var/cfengine/bin. This is a fail safe protection mechanism that will attempt to have CFEngine automatically recover itself from some sort of corruption.
* If you look at the “Part 2 — How I compiled CFEngine” page, you’ll see that we manually changed some configurations in the Makefile. This was to ensure that libpcre, libgcc.so.1, and libcrypto.a were statically compiled into the CFEngine client binaries. We dont want to have CFEngine rely on software under /usr/sfw/lib or /usr/local/lib – its completely self contained in /var/cfengine (other than general system libraries.)
* cf-agent actually gets executed twice on each run. The first run is to update all policy files via execution of failsafe.cf from the master policy server, but not to actually execute the policies. The second run executes promises.cf and really performs the changes. We modify promises.cf. We never modify failsafe.cf or update.cf once in production.
* This allows us to have syntax errors in promises.cf, but allow the clients to recover themselves in an automated fashion. If promises.cf is corrupt, we can’t actually execute policies. But if failsafe.cf and update.cf are in a good state, the clients will continue to poll the master policy server for updated copies of files.
* We can correct promises.cf from our syntax error — clients will pull the updated and corrected promises.cf, and the auto-recovery process of the configs is complete.
* If you break failsafe.cf or update.cf on the clients, then the clients will have to be touched manually to recover. Don’t modify these configurations once in a production environment — or be extremely careful to test your changes if you absolutely must.