Why the ext3 fsck’s after X days or Y mounts?

Reading through my RSS feeds, I came across the following blog post describing one Linux administrator using tune2fs to disable the “please run fsck on this file system after X days or Y mounts.”

I’ve got to admit, this is kind of annoying. I’ve taken production critical Linux boxes down for some maintenance, only to have the downtime extended +15-30 minutes because the file system was configured to run a fsck. Google searching this topic even shows other administrators trying other stupid tactics to avoid the fsck on reboot.

Is there really any value on having fsck run after some period of time? On Unix based systems (and even in Windows), fsck (or chkdisk) only runs when the kernel notices that a file system is in some sort of inconsistent state. So then I ask, why did the Linux community decide to run fsck on file systems in consistent state? ZFS has a “scrub” operation that can be ran against a dataset, but even that is comparing block level checksums. Ext2/3, RiserFS, XFS don’t perform block level checksums (btrfs does) so why the need to run fsck after some period of time? Does running fsck give folks the warm n’ fuzzies that their data is clean, or is there some deeper technical reason why this is scheduled? If you have any answers / historical data, please feel free to share. =)

Anatomy of Linux Kernel Shared Memory

Slashdot covered a post from IBM developerworks that I thought was relevant to pass along here. The Linux 2.6.32 Kernel performs memory de-duplication a.k.a. KSM (kernel shared memory). Its a great read if you’re interested in the advancements of Linux Virtualization.

http://www.ibm.com/developerworks/linux/library/l-kernel-shared-memory/index.html


Memory de-duplication in the Linux kernel
M. Tim Jones, Independent author
Summary: Linux® as a hypervisor includes a number of innovations, and one of the more interesting changes in the 2.6.32 kernel is Kernel Shared Memory (KSM). KSM allows the hypervisor to increase the number of concurrent virtual machines by consolidating identical memory pages. Explore the ideas behind KSM (such as storage de-duplication), its implementation, and how you manage it.

Slides from my CentOS cluster server presentation

The fine folks at ALE invited me back this month to give a presentation on CentOS cluster server. I had a blast presenting, and want to thank everyone for coming out. The slide deck I used is now up on my website, and I will post links to the cluster configuration file I discussed later this week.

Converting time since the epoch to a human readable string

I was parsing some Netbackup logs today, and needed a way to convert the time since the epoch into a human readable string. A while back I read about the various forms of input that can be passed to the GNU date’s “-d” option, one of these being the time since the epoch:

$ date -d @1263408025
Wed Jan 13 13:40:25 EST 2010

This solved my issue, though unfortunately it’s not super portable. Food for thought.

Solaris reporting multiple devices sharing IRQ assignments

One of my co-workers this week was fighting disk failure on a Solaris 10 x86 host.  I was checking /var/adm/messages and came across something interesting.

Apr 11 03:29:21 sinatra.fatkitty.com nge: [ID 801725 kern.info] NOTICE: nge1: Using FIXED interrupt type
Apr 11 03:29:21 sinatra.fatkitty.com unix: [ID 954099 kern.info] NOTICE: IRQ20 is being shared by drivers with different interrupt levels.
Apr 11 03:29:21 sinatra.fatkitty.com This may result in reduced system performance.
Apr 11 03:29:21 sinatra.fatkitty.com mac: [ID 469746 kern.info] NOTICE: nge1 registered

Weird.  So, x86 hardware assigns different IRQ assignments to devices.  This usually happens at the BIOS level. Anyways, I was kind of curious as to what devices were sharing IRQ20.  We can invoke the kernel debugger and run a dcmd to find this out.

root@sinatra:/var/adm#mdb -k
Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp ufs ip hook neti sctp arp usba fcp fctl lofs zfs random nfs md cpc fcip crypto logindmux ptm ]
> ::interrupts -d
IRQ  Vector IPL Bus   Type  CPU Share APIC/INT# Driver Name(s)
3    0xb0   12  ISA   Fixed 3   1     0x0/0x3   asy#1
4    0xb1   12  ISA   Fixed 3   1     0x0/0x4   asy#0
9    0x81   9   PCI   Fixed 1   1     0x0/0x9   acpi_wrapper_isr
16   0x60   6   PCI   Fixed 7   1     0x0/0x10  bge#0
17   0x62   6   PCI   Fixed 7   1     0x0/0x11  bge#1
20   0x63   6   PCI   Fixed 2   2     0x0/0x14  nge#1, nv_sata#0
21   0x20   1   PCI   Fixed 4   1     0x0/0x15  ehci#0
22   0x21   1   PCI   Fixed 5   1     0x0/0x16  ohci#0
23   0x61   6   PCI   Fixed 0   1     0x0/0x17  nge#0
24   0x82   7         MSI   6   1     –         pcie_pci#3
25   0x83   7         MSI   6   1     –         pcie_pci#3
160  0xa0   0         IPI   ALL 0     –         poke_cpu
192  0xc0   13        IPI   ALL 1     –         xc_serv
208  0xd0   14        IPI   ALL 1     –         kcpc_hw_overflow_intr
209  0xd1   14        IPI   ALL 1     –         cbe_fire
210  0xd3   14        IPI   ALL 1     –         cbe_fire
240  0xe0   15        IPI   ALL 1     –         xc_serv
241  0xe1   15        IPI   ALL 1     –         apic_error_intr

Bleh.  So, the SATA device and the nVidia gigabit NIC in port 1 are both sharing IRQ20.    We’re using nge1 on this host for IPMP.  Usually, we just stick with the Broadcom NICs, but, this is one of those one-off cases.  We wanted to use IPmP over the different chipsets to maximize redundancy on the host.
Anywho, the output above shows IRQ20 bound to CPU2, so we can poke at it with intrstat to see the interrupt activity happening on that processor.  Here, I invoke interstat to just output CPU2 over 10s intervals.

root@sinatra:/var/adm#intrstat -c 2 10
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       101  1.0
nv_sata#0 |       101  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       158  1.5
nv_sata#0 |       158  0.0
ohci#0 |         0  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |        99  1.0
nv_sata#0 |        99  0.0
ohci#0 |         0  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       108  1.0
nv_sata#0 |       108  0.0
ohci#0 |         0  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       160  1.5
nv_sata#0 |       160  0.0
ohci#0 |         0  0.0
^C

So, its really minimal interrupt activity going on here. The application on this host is completely based on ramdisk, and the primary IPmP NIC is on bge0, so both devices really aren’t competing for resources.  But, if nge1 was super busy and so was disk, this would be a bottleneck.  Anyways, I thought it was interesting.  I learned a new mdb dcmd and had one of those rare use cases for poking at intrstat.