I recently found the cause for a more indirect bug, PR pkg/51623:
running qemu-x86_64 with -smp 4 - the additional CPUs don't start.

It's not specifically exciting, but I still thought I'd do a writeup
about things I did on the way to finding the cause.


The bug
-------

QEMU is a great emulator, and attempting to run NetBSD with -smp 4
would fail. It would boot normally until:
	cpu1: failed to start
	cpu2: failed to start
	cpu3: failed to start

Booting without -smp, or with a single emulator processor works fine.
Since NetBSD boots on an identical real machine, the cause is likely
within QEMU, but we'd still like it to work. Where do we start?


Setting up
----------

First of all, we'd like to setup an image to be used by QEMU. I've
fetched a NetBSD/amd64 ISO image, and ran:
	dd if=/dev/zero of=nbsd.img # until it got large enough (4GB for me)
	qemu -cdrom NetBSD-7.99.73-amd64.iso -hda nbsd.img

And did an install within QEMU.

That got me a working image I could boot with the simple command of
	qemu -hda nbsd.img

Another thing needed are NetBSD sources and a build environment. To fetch
NetBSD sources and build a kernel:
	cvs -danoncvs@anoncvs.NetBSD.org:/cvsroot co src
	cd src
	./build.sh -U -u -j10 -m amd64 -O ~/obj tools

At which point you can start building kernels, like the kernel in
sys/arch/amd64/conf/GENERIC, with the following command:
	./build.sh -U -u -j10 -m amd64 -O ~/obj kernel=GENERIC

The most important to know flag here is -u (build incrementally). You
may want to remove it, for example if you edit the kernel config or
make a change to header files, or just feel paranoid about your changes
taking effect.


Quickly testing changes
-----------------------

Emulators are awesome testing platforms. You can test your changes very
quickly. To test out a new kernel, we'll first create a vnd (filesystem
image that can be mounted):
	# vndconfig vnd0 nbsd.img

Then we can mount it to change files:
	# mount /dev/vnd0a /mnt
	# mv /home/fly/obj/sys/arch/amd64/compile/GENERIC/netbsd /mnt
	# umount /mnt

I chose 'mv' because it would error if the kernel is not yet ready
when repeatedly testing.

You don't want to manipulate the image while it is mounted from QEMU.
Otherwise you'll experience some weird non-existent files and possible
host crashes.

You'll end up with a dirty filesystem image by repeatedly killing QEMU
and not running fsck anyway, so that may come and bite you.

Now to test a change I will rebuild a kernel, mount image, mv kernel to
image, umount image, and run qemu.

Time to test a change is a few seconds, perfect for "I am trying to
guessing my way in the dark for a problem I don't fully understand".


Actually the bug
----------------

Since QEMU is popular, a number of people mentioned managing to boot
with -smp with Linux KVM with various options, 'other CPUs in QEMU'
was a good starting point.

Having heard reports it works on some CPUs, I initially suspected that the
crazy netbsd/x86 kernel feature of patching the kernel at boot based on CPU
features is behind the problem. Indeed, attempting to emulate an older CPU
(-cpu phenom, but I tried a bunch) did work.

A good code pointer is the string "failed to start". It appears in
sys/arch/x86/x86/cpu.c:774. We know with confidence that anything that went
wrong, happened before it.

The code leading up to it attempts to add CPUF_GO to ci->ci_flags, and then
waits checking to see if CPUF_RUNNING is set in ci->ci_flags. In our case,
it's not. What should set it?

Looking for the string 'CPUF_RUNNING'. Where it's set isn't obvious:
	atomic_or_32(&ci->ci_flags, CPUF_RUNNING);

within cpu_init, which can only be called from cpu_hatch. But in case that
fails, we can also look for 'CPUF_GO', which is more obvious, within cpu_hatch:

	 * Wait to be brought online.  Use 'monitor/mwait' if available,
	 * in order to make the TSC drift as much as possible. so that
	 * we can detect it later.  If not available, try 'pause'.
	 * We'd like to use 'hlt', but we have interrupts off.
	 */
	while ((ci->ci_flags & CPUF_GO) == 0) {
		if ((cpu_feature[1] & CPUID2_MONITOR) != 0) {
			x86_monitor(&ci->ci_flags, 0, 0);
			if ((ci->ci_flags & CPUF_GO) != 0) {
				continue;
			}
			x86_mwait(0, 0);
		} else {
			for (i = 10000; i != 0; i--) {
				x86_pause();
			}
		}
	}

Secondary CPUs loop waiting until CPUF_GO is set, and they can continue,
at which point they set CPUF_RUNNING.

This code also tests for a CPU feature. Perfect candidate for our bug,
which is CPU-feature-in-emulator-dependent.

Since testing a change is still only a few seconds, I tried to quickly
use both cases by surrounding it with #if 0; #endif. It didn't immediately
work, but I still thought it's a promising candidate, so I looked at the
actual function x86_patch. It's in an assembly file,
sys/arch/amd64/amd64/cpufunc.S:443
	NENTRY(x86_pause)
		pause
		ret

What is the x86 instruction 'pause' supposed to do? well, searching
suggested it's equivalent to 'nop', so I figured I'd try a bunch of nops
in place of it, and sure enough, it boots.

So instead of sleeping and checking the value, it'd hang here, not
exiting the loop in time and setting CPUF_RUNNING, leading to the hang
we saw.


Not shown in this article
-------------------------

- Having spent some time thinking it's an Intel vs. AMD thing, as 'phenom'
  worked, but no Intel thing I tried did. I should've tested more CPUs
  I've attempted to disable various Intel discovery bits and putting prints
  along the way to no avail.
- Looking at the CPU definition in QEMU with no gained benefit.
- Trying to disable parts and all of x86_patch, thinking it is the 'cx8'
  feature, and thinking it doesn't work because parts of the function are
  needed.
- Printing strings in random places, hoping to catch the exact code where
  before the hang.
- Sometimes changes would fail to boot or build.
  We default to -Werror, so to avoid errors from failing to use a variable,
  I'd add CFLAGS=-Wno-error=unused-function or someother to the environment.


Summary
-------

having previously swapped arguments for -cpu phenom a couple of times
and thinking I got it right, I made sure to triple check things. A clean
kernel build, and clean sources with just that change, and checking I'm
testing the same command as in the bug.

After 45 kernels and several hours, I finally have kernel booting.

What's next? it was very slow at boot, so it might be critical code.
monitoring a value sounds a lot better than repeatedly pausing and checking
it, so it's probably what we want to do. is CPUID2_MONITOR set incorrectly
on NetBSD or QEMU?

Maybe the whole code is broken and happened to not be an issue because
machines with multiple CPUs also had CPUID2_MONITOR.


Update
------

We're checking CPUF_GO every 10,000 pauses. Printing stuff shows it
does run the loop several times, and lowering the number of pause calls
between checks to 10 also made it successfully boot, too.

What appears to be the QEMU implementation in pause in
target/i386/misc_helper.c is:

static void do_pause(X86CPU *cpu)
{
    CPUState *cs = CPU(cpu);

    /* Just let another CPU run.  */
    cs->exception_index = EXCP_INTERRUPT;
    cpu_loop_exit(cs);
}

Certainly sounds more expensive than nops.

Most emulated CPUs don't claim support for MONITOR, but Phenom does.


Maybe permalink: http://coypu.sdf.org/20170525-qemu-smp