You can disable hardware directly from the OBP with “asr” commands. If it’s a production critical machine, and it won’t boot because of a failed component, you can disable the hardware from the OBP and get the machine back up (although crippled) to minimize your production downtime impact.
Rebooting with command: boot Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0 File and args: -rsv Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.12 00/07/17 15:48:16. Loading: /platform/SUNW,Sun-Fire-V445/ufsboot Loading: /platform/sun4u/ufsboot ERROR: Last Trap: Corrected ECC Error
{3} ok
YIKES!@#$! We have memory failure.
The OBP keyword “sifting” will search through all of the commands the OBP knows for a particular string. So to search for all of the commands that contain asr:
{3} ok sifting asr In vocabulary srassembler (f001d858) rdasr (f001d550) wrasr (f001d53c) rdasr In vocabulary forth (f008ee08) asr-list-keys (f008ed2c) asr-enable (f008ebd8) asr-disable (f008d22c) .asr (f008cb50) asr-clear (f0052240) asr-policies
So, the main commands here then are asr-list-keys (show what we can disable) .asr (show what we already have disabled) asr-enable, asr-disable, and asr-clear
{3} ok asr-list-keys
key = net2&3 /pci@1f,700000/pci@0/pci@2/pci@0/@4 key = net0&1 /pci@1e,600000/pci@0/pci@1/pci@0/@4 key = ide /pci@1f,700000/pci@0/pci@1/pci@0/@1f key = usb /pci@1f,700000/pci@0/pci@1/pci@0/@1c key = pci7 /pci@1f,700000/pci@0/@9 key = pci6 /pci@1e,600000/pci@0/@9 key = pci5 /pci@1f,700000/pci@0/pci@2/pci@0/@8 key = pci4 /pci@1f,700000/pci@0/pci@2/pci@0/@8 key = pci3 /pci@1e,600000/pci@0/pci@1/pci@0/@8 key = pci2 /pci@1e,600000/pci@0/pci@1/pci@0/@8 key = pci1 /pci@1f,700000/pci@0/@8 key = pci0 /pci@1e,600000/pci@0/@8 key = cpu3-bank3 key = cpu3-bank2 key = cpu3-bank1 key = cpu3-bank0 key = cpu2-bank3 key = cpu2-bank2 key = cpu2-bank1 key = cpu2-bank0 key = cpu1-bank3 key = cpu1-bank2 key = cpu1-bank1 key = cpu1-bank0 key = cpu0-bank3 key = cpu0-bank2 key = cpu0-bank1 key = cpu0-bank0
Since we have an ECC memory error, we know it is with one of the above memory banks. By disabling the memory banks on each CPU one at a time, by trial and error we can find the failed memory.
{3} ok .asr There are no devices disabled by ASR.
Disabling cpu0-2 kept hitting the ECC memory error. Lets disable CPU3.
{3} ok asr-disable cpu3-bank0 {3} ok asr-disable cpu3-bank1 {3} ok asr-disable cpu3-bank2 {3} ok asr-disable cpu3-bank3
{3} ok .asr cpu3-bank3 Disabled by USER No reason given cpu3-bank2 Disabled by USER No reason given cpu3-bank1 Disabled by USER No reason given cpu3-bank0 Disabled by USER No reason given
And lets boot the machine
Sun Fire V445, No Keyboard Copyright 2006 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.22.19, 24576 MB memory installed, Serial xxxxxxxxx Ethernet address 0:14:4f:xx:xx:xx, Host ID: xxxxxxx
NOTICE: CPU 3 has 8192/8192 MB of memory disabled
ERROR: The following devices are disabled: cpu3-bank3 cpu3-bank2 cpu3-bank1 cpu3-bank0
Thanks for telling me!
Rebooting with command: boot -rsv Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0 File and args: -rsv Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.12 00/07/17 15:48:16. Loading: /platform/SUNW,Sun-Fire-V445/ufsboot Loading: /platform/sun4u/ufsboot module /platform/sun4u/kernel/sparcv9/unix: text at [0x1000000, 0x107a767] data at 0x1800000 module misc/sparcv9/krtld: text at [0x107a768, 0x10933af] data at 0x184c760 module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10933b0, 0x11f0f17] data at 0x1852040 module /platform/SUNW,Sun-Fire-V445/kernel/misc/sparcv9/platmod: text at [0x11f0f18, 0x11f1817] data at 0x18a45e0 module /platform/sun4u/kernel/cpu/sparcv9/SUNW,UltraSPARC-IIIi: text at [0x11f1880, 0x120278f] data at 0x18a4e80 SunOS Release 5.10 Version Generic_118833-33 64-bit Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Ethernet address = 0:14:4f:2b:ea:aa mem = 25165824K (0x600000000) avail mem = 25226371072 root nexus = Sun Fire V445
YAY! Our gimpy machine is going back into production minus 8gb of memory. There will be a performance impact running on less system resources, but better something than nothing?
There is quite a bit of documentation around the internet on the linux boot process, but Gustavo Duarte I think did an excellent job describing this in a clear and concise way. He also has several links to the Linux kernel source code and describes what is occurring step-by-step through the bootstrap phase all the way to the execution of /sbin/init.
His first entry lays the foundation of the basis of the x86 Intel chipset, memory map, and logical motherboard layout. This provides a basic understanding about the traditional hardware motherboard implementations.
Next, he describes BIOS initialization, and loading of the MBR. This briefly touches on the boot loader which starts the Linux bootstrap phase.
Finally, the kernel boot process is detailed with links to C and Assembly source code, with a brief narrative of exactly what is happening.
This was an awesome description of the early-on start up and initialization phases of hardware and bootstrapping of the O/S. Gustavo provides a great description of real-mode and protected-mode CPU states.
Thanks Gustavo!
I recently encountered a bug in one of the Linux utilities I was using, and upgrading to the latest version of the utility appeared to fix the issue. Being the curious guy I am, I started poking around the web and various release notes to see when the issue was fixed. While digging through this information, I came across the SUPER handy yum changelog plugin. This nifty plugin will display the changes that have occurred to a package, along with the version those changes were incorporated into. To use the changelog plugin, you first need to install it:
$ yum install yum-changelog
After the plugin is installed, you can add the “–changelog” argument to the yum command line to view the changelog for that package:
$ yum update kernel --changelog
Loading "installonlyn" plugin
Loading "changelog" plugin
Setting up Update Process
Setting up repositories
other.xml.gz 100% |=========================| 1.1 MB 00:08
##################################################
361/361
other.xml.gz 100% |=========================| 5.3 MB 00:42
##################################################
462/462
other.xml.gz 100% |=========================| 7.1 MB 00:15
##################################################
2400/2400
other.xml.gz 100% |=========================| 145 B 00:00
Reading repository metadata in from local files
Resolving Dependencies
--> Populating transaction set with selected packages. Please wait.
---> Downloading header for kernel to pack into transaction set.
kernel-2.6.18-53.1.14.el5 100% |=========================| 258 kB
00:00
---> Package kernel.i686 0:2.6.18-53.1.14.el5 set to be installed
--> Running transaction check
Changes in packages about to be updated:
kernel - 2.6.18-53.1.14.el5.i686
Wed Mar 5 17:00:00 2008 Karanbir Singh <kbsingh@centos.org>
- Change gpg key to CentOS
Tue Feb 19 17:00:00 2008 Anton Arapov <aarapov@redhat.com>
[2.6.18-53.1.14.el5]
- merge from 2.6.18-53.1.13 to 2.6.18-53.1.12
- [nfs] potential file corruption issue when writing (Jeff Layton )
[432078]
- [ppc] chrp: fix possible strncmp NULL pointer usage (Vitaly
Mayatskikh ) [396821]
- [isdn] i4l: fix memory overruns (Vitaly Mayatskikh ) [425171]
- [isdn] fix possible isdn_net buffer overflows (Aristeu Rozanski )
[392151] {CVE-2007-6063}
- [mm] hugepages: leak due to pagetable page sharing (Larry Woodman )
[431522]
- [net] NULL dereference in iwl driver (Vitaly Mayatskikh ) [401421]
{CVE-2007-5938}
- [misc] Denial of service with wedged processes (Jerome Marchand )
[221403]
- [xen] ia64: hvm guest memory range checking (Jarod Wilson ) [408701]
....
This is an incredibly useful feature, especially if you are trying to track down when a specific bug was fixed by a given Linux distribution. Rock on!
Eric Schrock has done some really cool work with integrating disk (SMART) /platform monitoring (IPMI) information into Opensolaris. Just recently, he has extended FMA with a new technology called SES (SCSI Enclosure Services) into build 93 of OpenSolaris.
This looks like some really cool stuff. The following was taken directly from his blog on the examples of using the new fmtopo utility to map out an external storage array.
$ /usr/lib/fm/fmd/fmtopo
...
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:serial=2029QTF0000000002:part=Storage-J4400:revision=3R13/ses-enclosure=0
hc://:product-id=SUN-Storage-J4400:chassis-id=22029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=1
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=1
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=2
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=3
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0386:part=375-3584-01/ses-enclosure=0/controller=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0074:part=375-3584-01/ses-enclosure=0/controller=1
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=1
...
$ fmtopo -V '*/ses-enclosure=0/bay=0/disk=0'
TIME UUID
Jul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
ASRU fmri dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
label string SCSI Device 0
FRU fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0809QCK012
server-id string
group: io version: 1 stability: Private/Private
devfs-path string /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
devid string id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________
phys-path string[] [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ]
group: storage version: 1 stability: Private/Private
logical-disk string c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0
manufacturer string SEAGATE
model string ST37500NSSUN750G 0720A0PC3X
serial-number string 5QD0PC3X
firmware-revision string 3.AZK
capacity-in-bytes string 750156374016
Dennis Clarke blogged about an introduction to opensolaris 2008.05, IPS, and how using ZFS (and beadm) as your root file system provides advantages with system upgrades and multiple root file systems. Take a look at his blog post hereif you haven’t yet seen IPS on opensolaris. A lot of people are really glad to see the Solaris package / patch system being revamped as its needed some attention for some time.
Speaking of Dennis and opensolaris, if you haven’t ever performed a complete build, he has another post hereshowing the entire build process of opensolaris.
Thanks Dennis! Your excitement around opensolaris rocks. And thanks for blastwave. =)