Berkeley Packet Filter support in OpenSolaris

While catching up with e-mail this morning, I noticed that the OpenSolaris community is planning to integrate the Berkeley packet filter into opensolaris:

“This case seeks to build on the Crossbow (PSARC/2006/357[7]) infrastructure
and provide a new (to OpenSolaris) mechanism for capturing packets: the
use of the Berkeley Packet Filter (BPF). The goal of this project is to
provide a method to capture packets that has higher performance than
what we have to offer today on Solaris (DLPI based schemes.) It also has
the added benefit of increasing our compatibility with other software
that has been built to use BPF.”

This is awesome news, and each month it seems there are fewer and fewer packages I have to bolt on to my OpenSolaris installations. Nice!

Renaming network interfaces on OpenSolaris hosts

While preparing my presentation for the Atlanta UNIX users group this past weekend, I accidentally created a VNIC with the wrong name. Instead of removing it and recreating it, I decided to test out the “rename-link” subcommand that was introduced as part of project Clearview. The rename feature worked swimmingly, and with a single dladm command I was able to rename the vnic6 interface to vnic5:

$ dladm create-vnic -l e1000g0 vnic6

$ dladm rename-link vnic6 vnic5

$ dladm show-vnic

LINK         OVER         SPEED  MACADDRESS           MACADDRTYPE         VID
vnic1        switch0      0      2:8:20:b:5f:1        random              0
vnic5        e1000g0      1000   2:8:20:56:3a:82      random              0
vnic3        switch0      0      2:8:20:23:6c:7e      random              0
vnic4        switch1      0      2:8:20:90:d9:26      random              0
vnic2        switch1      0      2:8:20:ea:7:18       random              0



I am super impressed with the work that the Sun networking team has done to simplify network interface management!

OpenSolaris network virtualization slides

I gave a talk tonight on network virtualization at the Atlanta UNIX users group. The talk focused on the OpenSolaris network virtualization project (project Crossbow), and described Crossbow and how to use it. I posted the slide deck I used on my website, so you can grab the presentation and the latest Nevada build and start creating virtual networks today!

Viewing and changing network device properties on Solaris hosts

Project brussels from the OpenSolaris project revamped how link properties are managed, and their push to get rid of ndd and device-specific properties is now well underway! Link properties are actually pretty cool, and they can be displayed with the dladm utilities “show-linkprop” option:

$ dladm show-linkprop e1000g0

LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
e1000g0      speed           r-   0              0              -- 
e1000g0      autopush        --   --             --             -- 
e1000g0      zone            rw   --             --             -- 
e1000g0      duplex          r-   half           half           half,full 
e1000g0      state           r-   down           up             up,down 
e1000g0      adv_autoneg_cap rw   1              1              1,0 
e1000g0      mtu             rw   1500           1500           -- 
e1000g0      flowctrl        rw   bi             bi             no,tx,rx,bi 
e1000g0      adv_1000fdx_cap r-   1              1              1,0 
e1000g0      en_1000fdx_cap  rw   1              1              1,0 
e1000g0      adv_1000hdx_cap r-   0              1              1,0 
e1000g0      en_1000hdx_cap  r-   0              1              1,0 
e1000g0      adv_100fdx_cap  r-   1              1              1,0 
e1000g0      en_100fdx_cap   rw   1              1              1,0 
e1000g0      adv_100hdx_cap  r-   1              1              1,0 
e1000g0      en_100hdx_cap   rw   1              1              1,0 
e1000g0      adv_10fdx_cap   r-   1              1              1,0 
e1000g0      en_10fdx_cap    rw   1              1              1,0 
e1000g0      adv_10hdx_cap   r-   1              1              1,0 
e1000g0      en_10hdx_cap    rw   1              1              1,0 
e1000g0      maxbw           rw   --             --             -- 
e1000g0      cpus            rw   --             --             -- 
e1000g0      priority        rw   high           high           low,medium,high 



As you can see in the above output, the typical speed, duplex, mtu and flowctrl properties are listed. In addition to those, the “maxbw” and “cpus” properties that were introduced with the recent crossbow putback are visible. The “maxbw” property is especially useful, since it allows you to limit how much bandwidth is available to an interface. Here is an example that caps bandwidth for an interface at 2Mb/s:

$ dladm set-linkprop -p maxbw=2m e1000g0

To see how this operates, you can use your favorite data transfer client:

$ scp techtalk1* 192.168.1.10:
Password:
techtalk1.mp3 5% 2128KB 147.0KB/s 04:08 ETA

The read/write link properties can be changed on the fly with dladm, so increasing the “maxbw” property will allow the interface to consume additional bandwidth:

$ dladm set-linkprop -p maxbw=10m e1000g0

Once the bandwidth is increased, you can immediately see this reflected in the data transfer progress:

techtalk1.mp3 45% 17MB 555.3KB/s 00:38 ETA

Clearview rocks, and it’s awesome to see that link properties are going to be managed in a standard uniform way going forward! Nice!

**** UPDATE ****

I incorrectly stated that the clearview project was responsible for this awesome work, when in fact network interface property unification is part of the brussels project. The original post was updated to reflect this.

IPMP rearchitecture bits now in Nevada build 107

The long awaited IPMP rearchitecture bits just got included into the crossbow integration in OpenSolaris build 107.   A new command, ipmpstat has been introduced.

If you use IPMP in production, take a look at the reachitecture here.   Peter’s documentation on the high level design is quality stuff.   The below was taken from page 3.

3 IPMP Rearchitecture: Basic Operation
3.1 IPMP Network Interface
As previously discussed, the lion’s share of the problems with IPMP stem from not treating each
IPMP group as its own IP interface. As an example, a typical two-interface IPMP group today
looks like:
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 129.146.17.56 netmask ffffff00 broadcast 129.146.17.255
groupname a
ce0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 129.146.17.55 netmask ffffff00 broadcast 129.146.17.255
ce1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 129.146.17.58 netmask ffffff00 broadcast 129.146.17.255
groupname a
ce1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 129.146.17.57 netmask ffffff00 broadcast 129.146.17.255

The above output shows ce0 and ce1 with test addresses, and ce0:1 and ce1:1 with data addresses. If ce0 subsequently fails, the data address on ce0:1 will be migrated to ce1:2, but the
test address will remain on ce0 so that the interface can continue to be probed for repair. In the future, this will instead look like:
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 129.146.17.56 netmask ffffff00 broadcast 129.146.17.255
groupname a
ce1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 129.146.17.58 netmask ffffff00 broadcast 129.146.17.255
groupname a
ipmp0: flags=801000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 4
inet 129.146.17.55 netmask ffffff00 broadcast 129.146.17.255
groupname a
ipmp0:1: flags=801000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 4
inet 129.146.17.57 netmask ffffff00 broadcast 129.146.17.255

That is, all of the IP data addresses associated with the IPMP group will instead be hosted on an IPMP IP interface, such as ipmp013. With this new model, data addresses will no longer be
associated with any specific physical underlying interface, but instead will belong to the IPMP group as a whole. As will become clear, this addresses many outstanding problems and vastly simplifies the implementation. There will be a one-to-one correspondence between IPMP groups and IPMP interfaces. That is, each IPMP group will have exactly one IPMP interface. By default, each IPMP interface will be named ipmpN , but administrators will be encouraged to specify a name of their choosing, as described in section 4.1.5. Since an IPMP interface’s name will not be fixed, the system will set a new IPMP flag on all IPMP interfaces to indicate that the interface has special properties and semantics, as detailed throughout this document.

A completely (local) diskless datacenter with iSCSI

Being able to boot a machine from SAN isn’t exactly a new concept.  Instead of having local hard drives in thousands of machines, each machine logs into the fabric and boots the O/S from a LUN exported via fiber on the SAN.  This requires a little bit of configuration on the Fiber HBA, but it has the advantage of no longer dealing with local disk failure.

In OpenSolaris Navada build 104 on x86 platforms, iSCSI boot was incorporated.

If you have a capable NIC, you can achieve the same results of “boot from SAN” as fiber, but without the additional costs of an expensive fiber SAN network.  Think of the possibilities here —

Implement a new AmberRoad Sun Storage 7000 series NAS device like the 7410 exporting hundreds iSCSI targets for each of your machines, implement ZFS Volumes on the backend, and leverage the capability of ZFS snapshots, clones, etc with your iSCSI root file system volumes for your machines.  Even if your “client” machine mounts a UFS root filesystem over iSCSI, the backend would be a ZFS volume.
Want to provision 1000 machines in a day?  Build one box, ZFS snapshot/clone the volume, and create 1000 iSCSI targets.  Now the only work comes in configuring the OpenSolaris iSNS server with initiator/target parings.   Instant O/S provisioning from a centrally managed location.

Implement two Sun Storage 7410 with clustering, and now you have a HA solution to all O/Ses running in your datacenter.

This is some pretty cool technology.  Now, you have only one machine to replace disk failures at, instead of thousands, at a fraction of the cost it would take to implement this on Fabric!  Once this technology works out the kinks and becomes stable, this could be the future of server provisioning and management.