Personal tools

Site Navigation

« February 2021 »
February
MoTuWeThFrSaSu
1234567
891011121314
15161718192021
22232425262728
 
You are here: Home / publicwork / public blog entries / QEMU Hardware Interrupt Simulation

QEMU Hardware Interrupt Simulation

Filed under: ,
Describe how a QEMU VM simulates hardware interrupts to a Linux guest OS

Abstract

QEMU is an open source software program providing a machine emulator with two basic modes: QEMU Virtual Machine (also called Full-system emulation) and QEMU User Emulation.

The QEMU host is a physical computer. The QEMU executable runs on the host, providing an simulation environment within which guest software can run. The guest software can be almost any OS or user program on a variety of heterogeneous hardware architectures. This is accomplished in QEMU by translating the guest software instructions into host instructions using the TCG.

Many hosted/cloud systems are based on QEMU/KVM VMs providing guest performance almost equivalent to running the software natively on the host. In fact this paper is being distributed, and was written, on (in?) an Openhosting QEMU/KVM Ubuntu guest. I will not discuss KVM further other than to note that it is a great, possibly the best, option for homogenous host and guest hardware architecture.

This paper describes how QEMU translates and delivers host events (e.g. ethernet packet, keyboard/mouse input) as hardware interrupts into a VM guest running a Linux kernel.

QEMU Virtual Machine

What does a Virtual Machine do? Essentially it is a software simulation within which target software can run with the expectation that the simulator will react in a consistent manner to physical hardware. This is easier than it sounds but took a great amount of wisdom to develop.

Refer to the VM Configuration below for our simulated hardware: an Intel CPU based on the Bearlake architecture with 4GB RAM, using an ICH9 southbridge chip with a PCI-E bridge, and Intel e1000e ethernet card. The QXL-VGA simulated video has been commented out because it is unnecessary and raised many interrupts - the topic of this paper. The ethernet card could also be commented out but I use SSH a good deal for testing.

With a little more work, I could have run an ARM VM instead; that will be a future paper.

QEMU User Emulation

This is a very useful mode for developing cross-platform application software or building a non-native rootfs.

Start with Debian QEMU User Emulation and Debian CrossDebootstrap for more information.

Glossary and Definitions

  • ACPI: Advanced Configuration and Power Interface
  • D-I: Debian Installer
  • EDK2: EFI Development Kit version 2
  • IRQ: Interrupt Request, a hardware device is requesting service by the CPU, or in an SMP system one of the CPUs.
  • $K: the Linux kernel top of source tree
  • OVMF: Open Virtual Machine Firmware; a very nice port of the EDK to QEMU.
  • $Q_P: the QEMU executable, in my case: /usr/local/bin/qemu-system-x86_64
  • TCG: QEMU Tiny Code Generator
  • TB: TCG Translation Block, a memory area where the TCG writes the guest->host code translations, generally each TB has a maximum of 512 guest instructions.
  • UEFI: Unified Extensible Firmware Interface

Physical Hardware versus QEMU VM

This section discusses the characteristics of using a QEMU VM for developing/analyzing/deploying software.

First the disadvantages,

1. The BIG problem with running a significant payload over hetergenous architectures is it is SLOW. Many years ago, I tried to use a QEMU VM (host:X86_64,guest:ARMv7) to run Android apps and gave up. The kernel was relatively responsive but the Java layers were painfully slow. I was much happier developing natively on a Nexus tablet through ADB.

2. A secondary problem with a QEMU VM is the hardware simulation is only as good as the coders understanding of the hardware. This can cause problems when the guest kernel driver does not perform as anticipated: is it the driver or the QEMU simulation?

However, QEMU offers some dramatic advantages for development and analysis.

1. The QEMU Monitor can be used to inspect and modify the VM. There is a rich set of info commands to inspect devices, memory, and run state.

2. Runtime logging and trace events are available to easily track the VM runtime execution. See the -d help, -d trace:help and -D logfile commandline arguments for more information.

3. It is very easy to start QEMU under GDB to breakpoint in the VM, especially for device handling only available on physical hardware using an ICE or JTAG probe. Furthermore a custom built kernel with debug symbols can be deeply analyzed using the QEMU gdbstub feature (after adding nokaslr to the GRUB command line.)

4. The QEMU VM is software that can be modified for debugging. One specific example I give is on one project a long time ago I used QEMU to reverse engineer a touchpad. I modifying the QEMU VM to dump the Microsoft driver initialization sequence for the touchpad by using the technique described in Forshee QEMU touchpad. This worked, giving me the blueprint to add the initialization sequence to the Linux touchpad driver.

These advantages make QEMU a great environment for Linux kernel and application analysis and development.

Software Configuration

Host HW/OS

The host is a Dell XPS 15 9570 (12 core i7, 32GB RAM, 1TB SSD) running Ubuntu 18.04.5 (5.4.0-64-generic)

QEMU Emulator

I downloaded and built QEMU Version 5.1.0 in the source tree build area using the following configuration:

cd bin/debug/native
../../../configure --prefix=/opt/qemu \
             --target-list=x86_64-softmmu \
             --enable-debug \
             --enable-spice \
             --enable-virtfs \
             --enable-curses \
             --enable-libusb

VM HW/OS

The VM is a QEMU q35 machine (Intel bearlake architecture) using the QEMU OVMF bootloader with the following -readconfig configuration file.

VM Configuration

This is the QEMU configuration file describing the simulated hardware of the VM.

# qemu config file
# copied from prod_r.ovmf.cfg and modified for kernel gdb
# use the QEMU OVMF code and vars file

[drive "uefi-binary"]
  if = "pflash"
  format = "raw"
  file = "/opt/distros/qemu-5.1.0/pc-bios/edk2-x86_64-code.fd"
  readonly = "on"

[drive "uefi-varstore"]
  if = "pflash"
  format = "raw"
  file = "ovmfvars_cfg.fd"

# built from custom D-I ISO
[drive "disk"]
  format = "raw"
  file = "prod_r.raw"

# Enable QXL video, use $Q_P "-display none" to prevent window display
# 210103: comment out video to prevent qxl interrupts (qxl_update_irq)
#[device "video"]
#   driver = "qxl-vga"
#   bus = "pcie.0"
#   addr = "01.0"

# Create the n1 netdevice and map SSH to host 10022 port
[netdev "n1"]
   type = "user"
   hostfwd = "tcp::10022-:22"

# Use the default e1000e driver for n1 net
[device "net"]
   driver = "e1000e"
   netdev = "n1"
   bus = "pcie.0"
   addr = "02.0"

# accel = "kvm" or "tcg"
[machine]
  type = "q35"
  accel = "tcg"

[memory]
  size = "4096"

# set number of cores,
# 1 is best for GDB debug for a single CPU thread
# >1 is better for SMP work
# the qemu executable has several other threads
[smp-opts]
  cpus = "1"

Notice the following:

1. OVMF is split into a read-only code image and a read-write. During boot, OVMF can be entered in a console via GRUB and the ROM config modified.

  1. The video driver is commented out to reduce interrupts.
  2. Use accel=tcg to get a better approximation of interrupt handling.

4. The ethernet card is enabled (which causes many interrupts) for SSH access to the guest.

5. Number of cores is 1 to simplifies logic tracing, including reducing the number of pthreads for QEMU GDB.

QEMU Command Line

With the above VM configuration, the QEMU command line looks something like:

$Q_P -nodefaults -display none \
     -readconfig $Q_GIT/qemu_gdb.ovmf.cfg \
     -D /tmp/run1.log \
     -d int \
     -trace /tmp/I1/pic.events \
     -serial pty -monitor pty
  • $Q_P: /usr/local/bin/qemu-system-x86_64
  • $Q_GIT: local clone of my QEMU development git repo for the VM Configuration file. BTW, all my git repos are behind an apache server on my Openhosting server.
  • See the QEMU and Kernel Interrupt Diagnostics section for logging and tracing.
  • -serial pty -monitor pty: starts ttyS0 as a pseudo tty and the monitor as another.

On boot up, QEMU reports:

char device redirected to /dev/pts/11 (label compat_monitor0)
char device redirected to /dev/pts/12 (label serial0)

I then connect in a host xterm using screen /dev/pts/11 for the QEMU monitor, and screen /dev/pts/12 for the console. From there I can use man:screen commands (e.g. C-a H to log the session.) Having said that, I prefer SSH to the guest for most activity using the netdev hostfwd mapping from the command line:

host> ssh dave@localhost -p 10022

Guest OS

The guest OS is a mashup of:

  • debian-10.5.0-amd64-netinst.iso network D-I using a custom preseed.cfg. The preseed.cfg file automates the entire installation process, including admin user, package install, timezone, and specialty provisioning commands. The result is a quick and consistent debian root filesystem.
  • GRUB 2.02 image custom-built with EFI support.
  • Linux vanilla kernel 5.0.21 using a custom .config and make -jX LOCALVERSION=dstNN deb-pkg rule. This makes four debian packages
 11140468 Dec 15 12:13 linux-headers-5.0.21-dst5_5.0.21-dst5-1_amd64.deb
730884844 Dec 15 12:18 linux-image-5.0.21-dst5-dbg_5.0.21-dst5-1_amd64.deb
 50072004 Dec 15 12:13 linux-image-5.0.21-dst5_5.0.21-dst5-1_amd64.deb
  1030904 Dec 15 12:13 linux-libc-dev_5.0.21-dst5-1_amd64.deb

I used xorriso to repack the D-I ISO with the GRUB custom image and preseed.cfg files.

Next step is running qemu-img to create a blank 4GB prod_r.raw disk file.

Then I ran QEMU using the D-I ISO to create the ESP, rootfs and swap partitions in prod_r.raw (see QEMU config above for [drive "disk"].)

After D-I is successful and rebooted, I logged in to the guest. I ran the following script to enabled ttyS0 and added it as a console in /etc/default/grub. This way I can see all the boot messages on ttyS0.

sudo systemctl start serial-getty@ttyS0.service
sudo systemctl enable serial-getty@ttyS0.service

F=/etc/default/grub
# print boot messages
sudo sed -i -E 's/GRUB_CMDLINE_LINUX_DEFAULT="quiet"/GRUB_CMDLINE_LINUX_DEFAULT=""/' $F
# print kernel boot messages on VGA console and serial tty
sudo sed -i -E
  's/GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="nokaslr console=tty1 console=ttyS0,9600n8 "/' $F
# display the grub boot menu on both VGA console and ttyS0
sudo sed -i -E 's/#GRUB_TERMINAL=console/GRUB_TERMINAL="console serial"/' $F
sudo update-grub

Next I built the kernel packages incrementing LOCALVERSION each time: dst5 in the packages shows this was the fifth kernel build I did (playing around with the kernel config.)

And finally I installed the custom kernel in the guest:

  • man:scp three of the kernel deb packages (skipping the large -dbg

System Message: WARNING/2 (<string>, line 307)

Bullet list ends without a blank line; unexpected unindent.

package.)

  • install each using dpkg -i
  • update GRUB_DEFAULT in /etc/default/grub to run the new kernel.

Why is the -dbg package created but not installed? The vmlinux image from the debug build can be used over the QEMU gdbstub connection with the running kernel to match symbols to kernel addresses so one can break/step through kernel code.

The final config looks like this from a guest SSH session

q35ek:615> sudo fdisk -l
Disk /dev/sda: 4 GiB, 4294967296 bytes, 8388608 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 77A188ED-4F51-4CB7-BD32-A4A26D296F1B

Device       Start     End Sectors  Size Type
/dev/sda1     2048 1050623 1048576  512M EFI System
/dev/sda2  1050624 6299647 5249024  2.5G Linux filesystem
/dev/sda3  6299648 8386559 2086912 1019M Linux swap

q35ek:616> lsb_release -a
No LSB modules are available.
Distributor ID:       Debian
Description:  Debian GNU/Linux 10 (buster)
Release:      10
Codename:     buster

q35ek:617> uname -a
Linux q35ek 5.0.21-dst5 #1 SMP Mon Dec 14 15:52:46 EST 2020 x86_64 GNU/Linux

q35ek:618> cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : QEMU Virtual CPU version 2.5+
stepping        : 3
microcode       : 0x1000065
cpu MHz         : 2207.964
cache size      : 512 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
                  pat pse36 clflush mmx fxsr sse sse2 syscall nx lm nopl
                  cpuid pni cx16 hypervisor lahf_lm svm 3dnowprefetch vmmcall
bugs            : fxsave_leak sysret_ss_attrs spectre_v1 spectre_v2
                  spec_store_bypass
bogomips        : 4415.92
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Caveat

Many of the host and guest software steps are not necessary for the objective of this paper. One can create a Debian VM under QEMU using a standard installation ISO and then configure it manually.

However, I already was using the custom preseed Debian Installer, EDK2/OVMF, custom GRUB and custom kernel for other projects, primarily rapid VM setup and driver debugging, so it was just a matter of putting the pieces together. I will say that the custom QEMU build, D-I ISO with custom preseed and the custom kernel build greatly helped to facilitate this paper.

H/W Interrupt Handling in Linux

There is a great deal of outstanding literature on Linux interrupt handling. There is also a great deal of marginal literature on the same topic, which can confuse the researcher. Here are some literature I find accurate:

I will not repeat the information in these detailed documents but, for the purpose of this paper, will give a brief context of how the kernel provisions and then handles a hardware interrupt.

Provisioning an IRQ

An IRQ can be considered a hardware line between a device and a pin on an interrupt controller. Pin 4 on the interrupt controller equates to IRQ 4 (which happens to be a uart.) When the IRQ line is asserted, the interrupt controller picks a CPU to service the interrupt and asserts the CPU INTR pin. This is somewhat simplistic for PCI devices, which has another layer, and newer cards which have multiple interrupt lines to span multiple CPUs.

  • During kernel boot, the basic IRQ handling framework is set up in init_IRQ, including setting up the Interrupt vector table.
  • Later in the boot process, each device will be probed/initialized and then call request_threaded_irq or a wrapper (e.g. request_irq) to request an IRQ mapping for the device. These functions are usually identified by the __init macro before the function name and sometimes by a device_initcall macro.
  • The device init creates a struct irq_desc instance and populates the handler field with a handler function called when the IRQ is asserted. This is also known as the Interrupt Service Routine (ISR) for the device. We will use the name ISR in the next section to refer to the handler function.

Note: it is easy to tell an interrupt handler/ISR function because they all return irqreturn_t.

There are an increasing number of mechanisms to assign IRQ(s) to a device. The basic objective of each is to match the physical IRQ line to the interrupt controller to a logic IRQ number. These mechanisms will not be discussed in this paper but examples include:

Handling an IRQ

  • The device hardware/firmware completes an operation and asserts its interrupt pin.
  • This goes through one or more interrupt controllers which determine when, and which, CPU will be notified. When ready, the controller will assert the INTR pin to a CPU.
  • The CPU reads the interrupt controller to see which IRQ line is asserted, looks in a vector table for the irq index and calls the corresponding function for the irq index.
  • For most h/w irqs the kernel has a single function common_interrupt that prepares the CPU registers and then calls do_IRQ. do_IRQ eventually calls the desired device ISR function.
  • The ISR determines how to react to the interrupt. This typically entails reading the device status register for interrupt flags and calling the necessary sub-function to operate on the device.
  • Typically the ISR will retire/clear each interrupt during the interrupt handling. The device may also clear the ISR when it decides the kernel has handled it.

QEMU VM H/W Interrurpt Handling

Now that we have a basic idea of how the kernel handles an interrupt, we turn to the objective of this paper: how QEMU simulates a h/w interrupt to the Linux kernel running in the guest.

There are a minimum of three QEMU pthreads involved to simulate an interrupt:

  • The main thread is event driven, primarily handling I/O events.
  • The vCPU signal thread (running qemu_dummy_cpu_thread_fn) is a small Posix signal handler supporting the primary vCPU thread.
  • The vCPU thread (running qemu_tcg_cpu_thread_fn) is the primary vCPU thread executing the guest code generated in the TCG translation block.

Some notes:

Each configured vCPU has an instance of pthread #2 and #3. If accel = "kvm" then the vCPU thread is qemu_kvm_cpu_thread_fn. Increase the number of cpus for more parallel guest code execution. A more complex QEMU host environment can spin off multiple iothreads.

In QEMU, there are a small number of interrupt types (see cpu-all.h) organized as a bitmask, one of which is CPU_INTERRUPT_HARD (0x0002) indicating the interrupt is generated by a simulated device in the QEMU VM.

QEMU Main Thread - CPU_INTERRUPT_HARD

All function references below are for QEMU source code.

  • QEMU VM os_host_main_loop_wait blocks on glib_pollfds_poll
  • glib_pollfds_poll waits on the global gpollfds (gpollfds->len, glib_n_poll_fds) or a timeout
  • device-specific initialization adds callbacks to the poll list
  • gsi_handler calls qemu_set_irq for both s->i8259_irq[n] => pic_set_irq and s->ioapic_irq[n] => ioapic_set_irq
  • ioapic_set_irq calls iapic_service
  • through the memory interface this calls apic_mem_write(opaque=??, addr=4100, val=37, size=4)
  • this calls apic_send_msi, apic_deliver_irq, apic_bus_deliver, apic_set_irq, apic_update_irq
  • apic_update_irq calls cpu_interrupt(cpu, CPU_INTERRUPT_HARD)
  • this uses cpu_interrupt_handler which is configured to tcg_handle_interrupt for the TCG
  • tcg_handle_interrupt sets cpu->interrupt_request |= CPU_INTERRUPT_HARD and calls qemu_cpu_kick
  • qemu_cpu_kick calls pthread_kill to send a SIG_IPI (SIGUSR1) to the vCPU, which simulates the interrupt controller asserting the INTR pin

The interrupt type set in the cpu->interrupt_request field will be used by the vCPU thread. It will pull the IRQ from the APIC data structure as set in apic_update_irq.

QEMU vCPU Signal Thread

All function references below are for QEMU source code.

  • when SIG_IPI is received, the sigwait loop exits and the thread calls calls qemu_wait_io_event(cpu)
  • qemu_wait_io_event has qemu_cond_wait that falls through to qemu_wait_io_event_common(cpu). Notice that cpu->halt_cond is a pthread conditional.

QEMU vCPU Thread - CPU_INTERRUPT_HARD

All function references below are for QEMU source code

  • Finish running the current TCG Translation Block code and drop out of cpu_exec.
  • Loop around and re-enter cpu_exec, enter cpu_handle_interrupt and check cpu->interrupt_request for a pending interrrupt.
  • This eventually calls cc->cpu_exec_interrupt which maps to x86_cpu_exec_interrupt.
  • x86_cpu_exec_interrupt switches on cpu->interrupt_request for case CPU_INTERRUPT_HARD.
  • The CPU_INTERRUPT_HARD fragment calls cpu_get_pic_interrupt(env) to get the IRQ num and then do_interrupt_x86_hardirq(env, intno, is_hw=1)
  • This calls do_interrupt_all(env_archcpu(env), intno, 0, 0, 0, is_hw) which calls do_interrupt64.
  • do_interrupt64 sets up the guest x86 registers for the ISR and calls cpu_x86_load_seg_cache.
  • cpu_exec then drops down to tb_find and cpu_loop_exec_tb to run the ISR.

Example: QEMU ttyS0 Interrupt Path

This section attempts to give a concrete, simple example of how QEMU simulates a h/w device: a console UART manifested to the guest kernel as a serial driver using the ttyS0 device node. This example will illustrate reading a single character from ttyS0.

As seen above there is a fairly clean and linear path from the main thread to the vCPU thread for a h/w interrupt. Similar to the Linux kernel device framework, there is a small number of common frameworks with device-specific function callbacks.

QEMU Main Thread - serial device

QEMU simulates a 8250/16550 uart in serial-isa.c and serial.c.

These are the actions for the simulated tty h/w using a host man:pty.

  • pty_chr_state creates a new channel in the poll list and adds pty_chr_read_poll, pty_chr_read for glib_pollfds_poll
  • pty_chr_read calls qemu_chr_be_can_write to read a char from the linux pty
  • serial_be_realize sets the handler to serial_receive1
  • serial_receive1 uses recv_fifo_put(s, buf[i]) to add chars to the recv_fifo, starts the fifo_timeout_timer and finally calls serial_update_irq to set the serial irq=4.
  • The IRQ is raised when fifo_timeout_timer expires (fifo_timeout_int). This is consistent behavior with the real hardware which collects bytes in a FIFO for a small time to reduce interrupts.
  • The guest ISR will read each char from the QEMU FIFO until empty (guest reads so QEMU calls serial_ioport_write).
  • Finally QEMU (simulating the device) will clear the data ready interrupt flag telling the guest to exit its ISR.
QEMU vCPU Thread - serial device
  • The interrupt vector is 37, which is the Interrupt vector table mapping for IRQ=4 for the ttyS0 isa-serial device.
  • So for a tty interrupt do_interrupt_x86_hardirq(env, intno=37, is_hw=1) which eventually becomes do_interrupt64(env, intno=37,...) to trigger the serial ISR in the guest, or more accurately the get a new TB containing the ISR to execute.
Guest kernel driver

This is the serial ISR running in the guest, or more consistently with vCPU description: the translated code in the new TB!

  • serial8250_handle_irq is provisioned on IRQ=4 during boot (Provisioning an IRQ)
  • when IRQ=4 is received by do_IRQ it calls the handler/ISR
  • The ISR reads the serial UART status register to determine the interrupt then calls serial8250_rx_chars to read the FIFO until the interrupt flag is cleared.

QEMU and Kernel Interrupt Diagnostics

From the Physical Hardware versus QEMU VM chapter, here are some of the QEMU and Linux diagnostic tools I used for this paper.

QEMU VM

QEMU Monitor

Here are some useful monitor commands

  • info irq: interrupt count by IRQ (4 is ttyS0)
  • info pic: The IRQ pin to Interrupt vector table mapping (IRQ 4 maps to vector 37)
  • info qtree: Show VM device tree
isa-serial device::
        dev: isa-serial, id ""
          index = 0 (0x0)
          iobase = 1016 (0x3f8)
          irq = 4 (0x4)
          chardev = "serial0"
          wakeup = 0 (0x0)
          isa irq 4
  • info chardev: VM character device mapping to host device
QEMU Log and Trace

Logging can give a high-level view of a running VM but I find GDB more powerful (and flexible!) for deep inspection.

The command line arguments -d int gives a full cpu_dump_state on each interrupt. Vector 0xec is for the ICH9 APIC_LVT_TIMER, the origination for the guest kernel timekeeping tick:

85547: v=ec e=0000 i=0 cpl=0 IP=0010:ffffffff81a2cc92 pc=ffffffff81a2cc92
SP=0018:ffffffff82603e08 env->regs[R_EAX]=ffffff ff81a2c870
RAX=ffffffff81a2c870 RBX=0000000000000000 RCX=0000000000000001 RDX=ffff88817ba23f00
RSI=0000000000000083 RDI=0000000000000000 RBP=ffffffff82603e08 RSP=ffffffff82603e08
R8 =ffff88817ba1cf80 R9 =0000000000022b00 R10=ffffffff82603e10 R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=000000007dcd7a93
RIP=ffffffff81a2cc92 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 00000000 00000000
CS =0010 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 00000000 00000000
FS =0000 0000000000000000 00000000 00000000
GS =0000 ffff88817ba00000 00000000 00000000
LDT=0000 0000000000000000 00000000 00008200 DPL=0 LDT
TR =0040 fffffe0000003000 0000206f 00008900 DPL=0 TSS64-avl
GDT=     fffffe0000001000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=000055adc3aa2360 CR3=0000000175f5a000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000000000044 CCD=0000000000000000 CCO=EFLAGS
EFER=0000000000000d01

QEMU VM trace calls can be enabled using -trace events=<filename>. The full list of trace events can be seen with $Q_P -trace help. A good example list for serial interrupts is:

ioapic_set_irq
apic_deliver_irq
apic_local_deliver
serial_ioport_read
serial_ioport_write

When I type a character in the host ttyS0 screen I see the following in the -D logfile showing the interaction between the IO Advanced Programmable Interrupt Controller (in the simulate ICH9 h/w) and the serial uart (also in the ICH9):

8113@1613152133.088571:ioapic_set_irq vector: 4 level: 0
8113@1613152133.093048:ioapic_set_irq vector: 4 level: 1
8113@1613152133.093113:apic_deliver_irq dest 1 dest_mode 1 delivery_mode 0
  vector 37 trigger_mode 0
8118@1613152133.093463:serial_ioport_read read addr 0x02 val 0xcc
8118@1613152133.093506:serial_ioport_read read addr 0x05 val 0x61
8118@1613152133.093535:ioapic_set_irq vector: 4 level: 0
8118@1613152133.093564:serial_ioport_read read addr 0x00 val 0x61
8118@1613152133.093607:serial_ioport_read read addr 0x05 val 0x60
8118@1613152133.093792:serial_ioport_read read addr 0x06 val 0xb0
8118@1613152133.093833:serial_ioport_read read addr 0x02 val 0xc1
8118@1613152133.094239:serial_ioport_write write addr 0x01 val 0x07
8118@1613152133.094263:ioapic_set_irq vector: 4 level: 1
8118@1613152133.094278:apic_deliver_irq dest 1 dest_mode 1 delivery_mode 0
  vector 37 trigger_mode 0
8118@1613152133.094355:ioapic_set_irq vector: 4 level: 0
8118@1613152133.094369:serial_ioport_read read addr 0x02 val 0xc2
8118@1613152133.094383:serial_ioport_read read addr 0x05 val 0x60
8118@1613152133.094396:serial_ioport_read read addr 0x06 val 0xb0
8118@1613152133.094410:serial_ioport_write write addr 0x01 val 0x05
8118@1613152133.094421:ioapic_set_irq vector: 4 level: 0
8118@1613152133.094441:serial_ioport_read read addr 0x02 val 0xc1
8113@1613152133.094536:apic_local_deliver vector 0 delivery mode 0
8118@1613152133.096998:serial_ioport_write write addr 0x01 val 0x07
8118@1613152133.097004:ioapic_set_irq vector: 4 level: 1
8118@1613152133.097007:apic_deliver_irq dest 1 dest_mode 1 delivery_mode 0
  vector 37 trigger_mode 0
8118@1613152133.097050:ioapic_set_irq vector: 4 level: 0
8118@1613152133.097053:serial_ioport_read read addr 0x02 val 0xc2
8118@1613152133.097056:serial_ioport_read read addr 0x05 val 0x60
8118@1613152133.097059:serial_ioport_read read addr 0x06 val 0xb0
8118@1613152133.097062:serial_ioport_write write addr 0x00 val 0x61
8118@1613152133.097065:ioapic_set_irq vector: 4 level: 0
8118@1613152133.097067:ioapic_set_irq vector: 4 level: 1
8118@1613152133.097069:apic_deliver_irq dest 1 dest_mode 1 delivery_mode 0
  vector 37 trigger_mode 0
8118@1613152133.097094:serial_ioport_write write addr 0x01 val 0x05
8118@1613152133.097097:ioapic_set_irq vector: 4 level: 0
8118@1613152133.097102:serial_ioport_read read addr 0x02 val 0xc1
8118@1613152133.097147:serial_ioport_read read addr 0x02 val 0xc1
8118@1613152133.097174:serial_ioport_write write addr 0x01 val 0x07
8118@1613152133.097177:ioapic_set_irq vector: 4 level: 1
8118@1613152133.097198:apic_deliver_irq dest 1 dest_mode 1 delivery_mode 0
  vector 37 trigger_mode 0
8118@1613152133.097223:ioapic_set_irq vector: 4 level: 0
8118@1613152133.097226:serial_ioport_read read addr 0x02 val 0xc2
8118@1613152133.097228:serial_ioport_read read addr 0x05 val 0x60
8118@1613152133.097230:serial_ioport_read read addr 0x06 val 0xb0
8118@1613152133.097252:serial_ioport_write write addr 0x01 val 0x05
8118@1613152133.097254:ioapic_set_irq vector: 4 level: 0
8118@1613152133.097257:serial_ioport_read read addr 0x02 val 0xc1
QEMU GDB

The runtime logs are great but the most useful tool for QEMU VM analysis is GDB (maybe I mention that already.) In order to save time, and maximize its power, I create a gdb command script to start and analyze the VM. Here is a (very small) fragment of the script showing the $Q_P command line arguments and then a set of conditional breakpoints for the ttyS0 interrupt work.

By default gdb will stop all threads when a breakpoint is hit. I disable this only the thread hitting the break will stop. This reduces timekeeping source switching and CPU lockup kernel oops on the ttyS0 console, which really floods the serial interrupt logic. After hitting a break, the desired thread may need to be selected info thr; thr 3 in this case.

# log all gdb output
set logging file /tmp/I1/qemu_gdb.log
set logging on

set pagination off

# stop only on current thread
set non-stop on

# set $Q_P arguments
set args -nodefaults \
-readconfig $Q_GIT/qemu_gdb.ovmf.cfg \
-display none \
-D /tmp/I1/qemu.log \
-trace events=/tmp/I1/pic.events \
-serial pty -monitor pty

# serial handling
# break in main thread reading the pty simulation of ttyS0
# b serial_receive1
# break in vCPU thread guest command to read the uart
# b serial_ioport_read if (intno == 37)
# break in vCPU on sending hardirq to guest
# b do_interrupt_all if (intno == 37)

# may need to switch to bp thread to show stack
# i thr
# thr 3
bt

The serial_ioport_read backtrace is instructive:

#0  0x0000555555abe66d in serial_ioport_read (opaque=0x5555571f79a0, addr=5,
  size=1) at /opt/distros/qemu-5.1.0/hw/char/serial.c:485
#1  0x0000555555989b6a in memory_region_read_accessor (mr=0x5555571f7b00,
  addr=5, value=0x7fffe933d1c0, size=1, shift=0, mask=255, attrs=...) at
  /opt/distros/qemu-5.1.0/softmmu/memory.c:434
#2  0x000055555598a02f in access_with_adjusted_size (addr=5,
  value=0x7fffe933d1c0, size=1, access_size_min=1, access_size_max=1,
  access_fn=0x555555989b2c <memory_region_read_accessor>, mr=0x5555571f7b00,
  attrs=...) at /opt/distros/qemu-5.1.0/softmmu/memory.c:544
#3  0x000055555598cb75 in memory_region_dispatch_read1 (mr=0x5555571f7b00,
  addr=5, pval=0x7fffe933d1c0, size=1, attrs=...) at
  /opt/distros/qemu-5.1.0/softmmu/memory.c:1385
#4  0x000055555598cc4a in memory_region_dispatch_read (mr=0x5555571f7b00,
  addr=5, pval=0x7fffe933d1c0, op=MO_8, attrs=...) at
  /opt/distros/qemu-5.1.0/softmmu/memory.c:1413
#5  0x000055555583cf76 in address_space_ldub (as=0x555556825420
  <address_space_io>, addr=1021, attrs=..., result=0x0) at
  /opt/distros/qemu-5.1.0/memory_ldst.inc.c:175
#6  0x0000555555a01c31 in helper_inb (env=0x555556b31560, port=1021) at
  /opt/distros/qemu-5.1.0/target/i386/misc_helper.c:44
#7  0x00007fffa5d303b8 in code_gen_buffer ()
#8  0x00005555558bdf2e in cpu_tb_exec (cpu=0x555556b28d00, itb=0x7fffa5e66980
  <code_gen_buffer+31877459>) at
  /opt/distros/qemu-5.1.0/accel/tcg/cpu-exec.c:172
#9  0x00005555558bee86 in cpu_loop_exec_tb (cpu=0x555556b28d00,
  tb=0x7fffa5e66980 <code_gen_buffer+31877459>, last_tb=0x7fffe933d7b8,
  tb_exit=0x7fffe933d7b0) at /opt/distros/qemu-5.1.0/accel/tcg/cpu-exec.c:636
#10 0x00005555558bf1aa in cpu_exec (cpu=0x555556b28d00) at
  /opt/distros/qemu-5.1.0/accel/tcg/cpu-exec.c:749
#11 0x00005555559846eb in tcg_cpu_exec (cpu=0x555556b28d00) at
  /opt/distros/qemu-5.1.0/softmmu/cpus.c:1356
#12 0x0000555555984f41 in qemu_tcg_cpu_thread_fn (arg=0x555556b28d00) at
  /opt/distros/qemu-5.1.0/softmmu/cpus.c  :1664
#13 0x0000555555e5ec8d in qemu_thread_start (args=0x555556b59370) at
  /opt/distros/qemu-5.1.0/util/qemu-thread-posix.c:521
#14 0x00007ffff3c406db in start_thread (arg=0x7fffe933e700) at
  pthread_create.c:463
#15 0x00007ffff396971f in clone () at
  ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

One can see the TCG running and the VM reading the serial device. Notice the helper_inb call. This is a TCG wrapper around the X86 inb instruction. A developer can write helper_ wrappers around specific guest instructions to control, monitor, even modify the particular guest instruction.

Now if that isn't COOL I don't know what is.

Guest Kernel

Linux kernel inspection is useful to match up the QEMU VM interrupts.

  • cat /proc/interrupts: great information about all interrupts counts, bus interface, type, device for each CPU.
  • dmesg to inspect ACPI interrupt settings
  • lspci -b -vvv | grep "Interrupt: pin": see what IRQ pins are used on the PCI bus

One can even match the ACPI definitions from dmesg with the OVMF ACPI definition file (./roms/edk2/OvmfPkg/AcpiTables/Dsdt.asl.)

Summary

I have given a fairly narrow explanation of how a QEMU VM generates interrupts to a guest OS. This paper can only give a cursory view into the power of QEMU. It is such an an expansive topic only made more so by the different target hardware architectures (x86, ARM, aarch64, SPARC - VERY underrated, PPC - no program counter!, s390), different devices types (disk, network, GPIO, SPI, I2C, while not endless is certainly not trivial) and the huge amount of technological progress it has absorbed. The TCG is an amazing engine (and so malleable with the helper_ wrappers.)

The QEMU developers produced, and continue to enhance, a wonderful sandbox to play in. The decision many chip manufacturers made to add VM capabilities to their chipsets only highlights how significant this tech is!

Finally, I will add a plug for my next paper: the QEMU gdbstub interface is the easiest way to step through a fully functional guest kernel. The ability to instrumenting the QEMU VM source code to provide even more insight is "icing on the cake."

Document Actions
Filed under: ,