You can disable hardware directly from the OBP with “asr” commands. If it’s a production critical machine, and it won’t boot because of a failed component, you can disable the hardware from the OBP and get the machine back up (although crippled) to minimize your production downtime impact.
Rebooting with command: boot Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0 File and args: -rsv Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.12 00/07/17 15:48:16. Loading: /platform/SUNW,Sun-Fire-V445/ufsboot Loading: /platform/sun4u/ufsboot ERROR: Last Trap: Corrected ECC Error
{3} ok
YIKES!@#$! We have memory failure.
The OBP keyword “sifting” will search through all of the commands the OBP knows for a particular string. So to search for all of the commands that contain asr:
{3} ok sifting asr In vocabulary srassembler (f001d858) rdasr (f001d550) wrasr (f001d53c) rdasr In vocabulary forth (f008ee08) asr-list-keys (f008ed2c) asr-enable (f008ebd8) asr-disable (f008d22c) .asr (f008cb50) asr-clear (f0052240) asr-policies
So, the main commands here then are asr-list-keys (show what we can disable) .asr (show what we already have disabled) asr-enable, asr-disable, and asr-clear
{3} ok asr-list-keys
key = net2&3 /pci@1f,700000/pci@0/pci@2/pci@0/@4 key = net0&1 /pci@1e,600000/pci@0/pci@1/pci@0/@4 key = ide /pci@1f,700000/pci@0/pci@1/pci@0/@1f key = usb /pci@1f,700000/pci@0/pci@1/pci@0/@1c key = pci7 /pci@1f,700000/pci@0/@9 key = pci6 /pci@1e,600000/pci@0/@9 key = pci5 /pci@1f,700000/pci@0/pci@2/pci@0/@8 key = pci4 /pci@1f,700000/pci@0/pci@2/pci@0/@8 key = pci3 /pci@1e,600000/pci@0/pci@1/pci@0/@8 key = pci2 /pci@1e,600000/pci@0/pci@1/pci@0/@8 key = pci1 /pci@1f,700000/pci@0/@8 key = pci0 /pci@1e,600000/pci@0/@8 key = cpu3-bank3 key = cpu3-bank2 key = cpu3-bank1 key = cpu3-bank0 key = cpu2-bank3 key = cpu2-bank2 key = cpu2-bank1 key = cpu2-bank0 key = cpu1-bank3 key = cpu1-bank2 key = cpu1-bank1 key = cpu1-bank0 key = cpu0-bank3 key = cpu0-bank2 key = cpu0-bank1 key = cpu0-bank0
Since we have an ECC memory error, we know it is with one of the above memory banks. By disabling the memory banks on each CPU one at a time, by trial and error we can find the failed memory.
{3} ok .asr There are no devices disabled by ASR.
Disabling cpu0-2 kept hitting the ECC memory error. Lets disable CPU3.
{3} ok asr-disable cpu3-bank0 {3} ok asr-disable cpu3-bank1 {3} ok asr-disable cpu3-bank2 {3} ok asr-disable cpu3-bank3
{3} ok .asr cpu3-bank3 Disabled by USER No reason given cpu3-bank2 Disabled by USER No reason given cpu3-bank1 Disabled by USER No reason given cpu3-bank0 Disabled by USER No reason given
And lets boot the machine
Sun Fire V445, No Keyboard Copyright 2006 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.22.19, 24576 MB memory installed, Serial xxxxxxxxx Ethernet address 0:14:4f:xx:xx:xx, Host ID: xxxxxxx
NOTICE: CPU 3 has 8192/8192 MB of memory disabled
ERROR: The following devices are disabled: cpu3-bank3 cpu3-bank2 cpu3-bank1 cpu3-bank0
Thanks for telling me!
Rebooting with command: boot -rsv Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0 File and args: -rsv Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.12 00/07/17 15:48:16. Loading: /platform/SUNW,Sun-Fire-V445/ufsboot Loading: /platform/sun4u/ufsboot module /platform/sun4u/kernel/sparcv9/unix: text at [0x1000000, 0x107a767] data at 0x1800000 module misc/sparcv9/krtld: text at [0x107a768, 0x10933af] data at 0x184c760 module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10933b0, 0x11f0f17] data at 0x1852040 module /platform/SUNW,Sun-Fire-V445/kernel/misc/sparcv9/platmod: text at [0x11f0f18, 0x11f1817] data at 0x18a45e0 module /platform/sun4u/kernel/cpu/sparcv9/SUNW,UltraSPARC-IIIi: text at [0x11f1880, 0x120278f] data at 0x18a4e80 SunOS Release 5.10 Version Generic_118833-33 64-bit Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Ethernet address = 0:14:4f:2b:ea:aa mem = 25165824K (0x600000000) avail mem = 25226371072 root nexus = Sun Fire V445
YAY! Our gimpy machine is going back into production minus 8gb of memory. There will be a performance impact running on less system resources, but better something than nothing?