Blog O' Matty


Disable Hardware on SPARC Platforms from the OBP

This article was posted by Matty on 2008-07-25 09:12:00 -0400 -0400

You can disable hardware directly from the OBP with “asr” commands.  If it’s a production critical machine, and it won’t boot because of a failed component, you can disable the hardware from the OBP and get the machine back up (although crippled) to minimize your production downtime impact.

Rebooting with command: boot Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0  File and args: -rsv Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.12 00/07/17 15:48:16. Loading: /platform/SUNW,Sun-Fire-V445/ufsboot Loading: /platform/sun4u/ufsboot ERROR: Last Trap: Corrected ECC Error

{3} ok

YIKES!@#$!  We have memory failure.

The OBP keyword “sifting” will search through all of the commands the OBP knows for a particular string.  So to search for all of the commands that contain asr:

{3} ok sifting asr In vocabulary  srassembler (f001d858) rdasr        (f001d550) wrasr        (f001d53c) rdasr In vocabulary  forth (f008ee08) asr-list-keys        (f008ed2c) asr-enable (f008ebd8) asr-disable          (f008d22c) .asr         (f008cb50) asr-clear (f0052240) asr-policies

So, the main commands here then are asr-list-keys (show what we can disable) .asr (show what we already have disabled) asr-enable, asr-disable, and asr-clear

{3} ok asr-list-keys

key = net2&3                /pci@1f,700000/pci@0/pci@2/pci@0/@4 key = net0&1                /pci@1e,600000/pci@0/pci@1/pci@0/@4 key = ide                   /pci@1f,700000/pci@0/pci@1/pci@0/@1f key = usb                   /pci@1f,700000/pci@0/pci@1/pci@0/@1c key = pci7                  /pci@1f,700000/pci@0/@9 key = pci6                  /pci@1e,600000/pci@0/@9 key = pci5                  /pci@1f,700000/pci@0/pci@2/pci@0/@8 key = pci4                  /pci@1f,700000/pci@0/pci@2/pci@0/@8 key = pci3                  /pci@1e,600000/pci@0/pci@1/pci@0/@8 key = pci2                  /pci@1e,600000/pci@0/pci@1/pci@0/@8 key = pci1                  /pci@1f,700000/pci@0/@8 key = pci0                  /pci@1e,600000/pci@0/@8 key = cpu3-bank3 key = cpu3-bank2 key = cpu3-bank1 key = cpu3-bank0 key = cpu2-bank3 key = cpu2-bank2 key = cpu2-bank1 key = cpu2-bank0 key = cpu1-bank3 key = cpu1-bank2 key = cpu1-bank1 key = cpu1-bank0 key = cpu0-bank3 key = cpu0-bank2 key = cpu0-bank1 key = cpu0-bank0

Since we have an ECC memory error, we know it is with one of the above memory banks.  By disabling the memory banks on each CPU one at a time, by trial and error we can find the failed memory.

{3} ok .asr There are no devices disabled by ASR.

Disabling cpu0-2 kept hitting the ECC memory error.  Lets disable CPU3.

{3} ok asr-disable cpu3-bank0 {3} ok asr-disable cpu3-bank1 {3} ok asr-disable cpu3-bank2 {3} ok asr-disable cpu3-bank3

{3} ok .asr cpu3-bank3              Disabled by USER No reason given cpu3-bank2              Disabled by USER No reason given cpu3-bank1              Disabled by USER No reason given cpu3-bank0              Disabled by USER No reason given

And lets boot the machine

Sun Fire V445, No Keyboard Copyright 2006 Sun Microsystems, Inc.  All rights reserved. OpenBoot 4.22.19, 24576 MB memory installed, Serial xxxxxxxxx Ethernet address 0:14:4f:xx:xx:xx, Host ID: xxxxxxx

NOTICE: CPU 3 has 8192/8192 MB of memory disabled

ERROR: The following devices are disabled: cpu3-bank3 cpu3-bank2 cpu3-bank1 cpu3-bank0

Thanks for telling me!

Rebooting with command: boot -rsv Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0  File and args: -rsv Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.12 00/07/17 15:48:16. Loading: /platform/SUNW,Sun-Fire-V445/ufsboot Loading: /platform/sun4u/ufsboot module /platform/sun4u/kernel/sparcv9/unix: text at [0x1000000, 0x107a767] data at 0x1800000 module misc/sparcv9/krtld: text at [0x107a768, 0x10933af] data at 0x184c760 module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10933b0, 0x11f0f17] data at 0x1852040 module /platform/SUNW,Sun-Fire-V445/kernel/misc/sparcv9/platmod: text at [0x11f0f18, 0x11f1817] data at 0x18a45e0 module /platform/sun4u/kernel/cpu/sparcv9/SUNW,UltraSPARC-IIIi: text at [0x11f1880, 0x120278f] data at 0x18a4e80 SunOS Release 5.10 Version Generic_118833-33 64-bit Copyright 1983-2006 Sun Microsystems, Inc.  All rights reserved. Use is subject to license terms. Ethernet address = 0:14:4f:2b:ea:aa mem = 25165824K (0x600000000) avail mem = 25226371072 root nexus = Sun Fire V445

YAY!  Our gimpy machine is going back into production minus 8gb of memory.  There will be a performance impact running on less system resources, but better something than nothing?

x86 / linux boot process

This article was posted by Matty on 2008-07-16 09:58:00 -0400 -0400

There is quite a bit of documentation around the internet on the linux boot process, but Gustavo Duarte I think did an excellent job describing this in a clear and concise way.  He also has several links to the Linux  kernel source code and describes what is occurring step-by-step through the bootstrap phase all the way to the execution of /sbin/init.

His first entry lays the foundation of the basis of the x86 Intel chipset, memory map, and logical motherboard layout.   This provides a basic understanding about the traditional hardware motherboard implementations.

Next, he describes BIOS initialization, and loading of the MBR.  This briefly touches on the boot loader which starts the Linux bootstrap phase.

Finally, the kernel boot process is detailed with links to C and Assembly source code, with a brief narrative of exactly what is happening.

This was an awesome description of the early-on start up and initialization phases of hardware and bootstrapping of the O/S.  Gustavo provides a great description of real-mode and protected-mode CPU states.

Thanks Gustavo!

Viewing the changes that have occurred to an RPM package

This article was posted by Matty on 2008-07-16 09:10:00 -0400 -0400

I recently encountered a bug in one of the Linux utilities I was using, and upgrading to the latest version of the utility appeared to fix the issue. Being the curious guy I am, I started poking around the web and various release notes to see when the issue was fixed. While digging through this information, I came across the SUPER handy yum changelog plugin. This nifty plugin will display the changes that have occurred to a package, along with the version those changes were incorporated into. To use the changelog plugin, you first need to install it:

$ yum install yum-changelog

After the plugin is installed, you can add the “–changelog” argument to the yum command line to view the changelog for that package:

$ yum update kernel --changelog

Loading "installonlyn" plugin
Loading "changelog" plugin
Setting up Update Process
Setting up repositories
other.xml.gz 100% |=========================| 1.1 MB 00:08

##################################################
361/361
other.xml.gz 100% |=========================| 5.3 MB 00:42

##################################################
462/462
other.xml.gz 100% |=========================| 7.1 MB 00:15

##################################################
2400/2400
other.xml.gz 100% |=========================| 145 B 00:00
Reading repository metadata in from local files
Resolving Dependencies
--> Populating transaction set with selected packages. Please wait.
---> Downloading header for kernel to pack into transaction set.
kernel-2.6.18-53.1.14.el5 100% |=========================| 258 kB
00:00
---> Package kernel.i686 0:2.6.18-53.1.14.el5 set to be installed
--> Running transaction check

Changes in packages about to be updated:

kernel - 2.6.18-53.1.14.el5.i686
Wed Mar 5 17:00:00 2008 Karanbir Singh <kbsingh@centos.org>
- Change gpg key to CentOS

Tue Feb 19 17:00:00 2008 Anton Arapov <aarapov@redhat.com>
[2.6.18-53.1.14.el5]
- merge from 2.6.18-53.1.13 to 2.6.18-53.1.12
- [nfs] potential file corruption issue when writing (Jeff Layton )
[432078]
- [ppc] chrp: fix possible strncmp NULL pointer usage (Vitaly
Mayatskikh ) [396821]
- [isdn] i4l: fix memory overruns (Vitaly Mayatskikh ) [425171]
- [isdn] fix possible isdn_net buffer overflows (Aristeu Rozanski )
[392151] {CVE-2007-6063}
- [mm] hugepages: leak due to pagetable page sharing (Larry Woodman )
[431522]
- [net] NULL dereference in iwl driver (Vitaly Mayatskikh ) [401421]
{CVE-2007-5938}
- [misc] Denial of service with wedged processes (Jerome Marchand )
[221403]
- [xen] ia64: hvm guest memory range checking (Jarod Wilson ) [408701]

....

This is an incredibly useful feature, especially if you are trying to track down when a specific bug was fixed by a given Linux distribution. Rock on!

SCSI Enclosure Services

This article was posted by Matty on 2008-07-15 11:18:00 -0400 -0400

Eric Schrock has done some really cool work with integrating disk (SMART) /platform monitoring (IPMI) information into Opensolaris.   Just recently, he has extended FMA with a new technology called SES (SCSI Enclosure Services) into build 93 of OpenSolaris.

This looks like some really cool stuff.  The following was taken directly from his blog on the examples of using the new fmtopo utility to map out an external storage array.

$ /usr/lib/fm/fmd/fmtopo

...

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:serial=2029QTF0000000002:part=Storage-J4400:revision=3R13/ses-enclosure=0

hc://:product-id=SUN-Storage-J4400:chassis-id=22029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=2

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=3

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0386:part=375-3584-01/ses-enclosure=0/controller=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0074:part=375-3584-01/ses-enclosure=0/controller=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=1

...

$ fmtopo -V '*/ses-enclosure=0/bay=0/disk=0'

TIME UUID
Jul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
ASRU fmri dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
label string SCSI Device 0
FRU fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0809QCK012
server-id string
group: io version: 1 stability: Private/Private
devfs-path string /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
devid string id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________
phys-path string[] [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ]
group: storage version: 1 stability: Private/Private
logical-disk string c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0
manufacturer string SEAGATE
model string ST37500NSSUN750G 0720A0PC3X
serial-number string 5QD0PC3X
firmware-revision string 3.AZK
capacity-in-bytes string 750156374016

dennis' experience with opensolaris 2008.05

This article was posted by Matty on 2008-07-11 14:59:00 -0400 -0400

Dennis Clarke blogged about an introduction to opensolaris 2008.05, IPS, and how using ZFS (and beadm) as your root file system provides advantages with system upgrades and multiple root file systems. Take a look at his blog post hereif you haven’t yet seen IPS on opensolaris.  A lot of people are really glad to see the Solaris package / patch system being revamped as its needed some attention for some time.

Speaking of Dennis and opensolaris, if you haven’t ever performed a complete build, he has another post hereshowing the entire build process of opensolaris.

Thanks Dennis! Your excitement around opensolaris rocks.  And thanks for blastwave.  =)