CPU Idling Driver for Solaris 2.5.1 (i386)

Continuing my messing around with Solaris 2.5.1, I began exploring the reasons behind the constant CPU usage from the VM. The rason, as you may already suspect, is that when Solaris idles, it never yields the CPU. I've written a driver to resolve this, which can be found on my Github. I've also made a binary package.

It's intriguing that if you have only one processor, then by default Solaris's idle thread simply calls an empty subroutine to idle the CPU. Perhaps it was expected that platform drivers should handle this in a platform-specific way. Who knows? Anyhow, here's the pertinent part of the idle thread:

; idle_cpu variable exported by the kernel.
idle_cpu: dd offset generic_idle_cpu
 
; Empty idle routine
generic_idle_cpu:
    push ebp
    mov ebp, esp
    leave
    ret
 
; Kernel idle thread routine
idle:
    push ebp
    mov ebp, esp
...
    cmp ncpus, 1
    jne .idle_multiple_cpus
 
    ; Do we have a runnable thread (i.e. work to do?)
    mov eax, fs:0
    cmp dword [eax+cpu.cpu_disp.n_runnable], 0
    jz .call_idle_cpu
 
    ; Switch to the runnable thread.
    call swtch
    jmp .loop_if_1_cpu
 
.call_idle_cpu:
    call [idle_cpu]
 
.loop_if_1_cpu:
    cmp ncpus, 1
    je idle_loop
 
.idle_multiple_cpus:
    ...

If the system contains muliple processors, the idle thread calls the set_idlecpu and unset_idlecpu callbacks provided by the PSM driver. This also does nothing, because the pcplusmp module doesn't provide any implementation for these callbacks.

This means that even when the kernel intends the CPU to be "idle" it's effectively spinning on the dispatcher queue count (the number of runnable threads in the queue.) The end result is that Solaris hogs a great deal of CPU time on the host processor. This results in degraded performance, especially on systems with Hyper Threading, as will the way that Solaris implements locking, but that's a topic for another post. :P

Solving this problem for a single processor is simple. Fortunately, Solaris exports the symbol for idle_cpu. This means that it can be replaced within a kernel driver, and that's precisely what I did, the difference being the pushf ; sti ; hlt ; popf sequence in my routine, which translates to:

Save EFLAGS.
Ensure interrupts are enabled.
Halt until the next interrupt.
Restore EFLAGS.

This way, when the CPU gets an interrupt, it will return back to the idle thread after servicing the interrupt, but will halt when it should be idle. Conveniently, Solaris uses the processor's local APIC to drive the system clock (or the PIT if the system lacks a local APIC), so you already know beforehand that the CPU will wake at some point. Otherwise, the kernel would just appear to hang when idling the CPU for the first time.

Solving this for multiple processors can be a bit tricky, but we'll assume that if you have multiple processors, that each has its own local APIC. For one, we can't just use poke_cpu() because that causes the APIC to raise the same interrupt as the system clock, and since the clock ISR calls the dispatcher, doing so results in a race condition between a processor holding the dispatcher lock, and another processor trying to switch to a runnable thread (which also wants to hold the dispatcher lock.)

Besides, as far as I can tell, the dispatcher will never call poke_cpu(), and the unset_idlecpu callback appears to be called only from the idle thread. If true, that means that the unset_idlecpu() callback is useless in the first place, or that somebody forgot to use it in the dispatcher code. :P

The one saving grace of this whole thing is that pcplusmp only uses the APIC timer on the boot processor to feed the clock ISR. The APICs on all the other CPUs have timers that are ripe for the picking to be used to ensure that the other CPUs wake up every now and again.

So what the driver does in this case is:

Save EFLAGS and disable interrupts.
If the local APIC timer is not set in periodic mode, we know that this CPU is not the boot processor, and we'll setup a one-shot timer.
Re-enable interrupts and halt, just as we would for a single-cpu system.
Mask the timer we just set up, so it doesn't fire again.
Restore EFLAGS.

Pretty simple, I'd say, and it works nicely with my VM to boot. Speaking of boot, my guest VM boots in one third the time with this driver installed, which only serves to further illustrate the performance penalty on processors with Hyper-Threading like my Core i7.

Tim Hentenaar's Blog