2018-09-03

As already know, we can access devices by PIO and MMIO to drive devices to work. For PIO, we can set the VMCS to intercept the specified port. But how MMIO emulation is implemented? This blog will discuss this.

For a summary, the following shows the process of MMIO implementation:

  1. QEMU declares a memory region(but not allocate ram or commit it to kvm)
  2. Guest first access the MMIO address, cause a EPT violation VM-exit
  3. KVM construct the EPT page table and marks the page table entry with special mark(110b)
  4. Later the guest access these MMIO, it will be processed by EPT misconfig VM-exit handler

QEMU part

QEMU uses function ‘memory_region_init_io’ to declare a MMIO region. Here we can see the ‘mr->ram’ is false so no really memory is allocated.

void memory_region_init_io(MemoryRegion *mr,
                           Object *owner,
                           const MemoryRegionOps *ops,
                           void *opaque,
                           const char *name,
                           uint64_t size)
{
    memory_region_init(mr, owner, name, size);
    mr->ops = ops ? ops : &unassigned_mem_ops;
    mr->opaque = opaque;
    mr->terminates = true;
}

void memory_region_init_ram(MemoryRegion *mr,
                            Object *owner,
                            const char *name,
                            uint64_t size,
                            Error **errp)
{
    memory_region_init(mr, owner, name, size);
    mr->ram = true;
    mr->terminates = true;
    mr->destructor = memory_region_destructor_ram;
    mr->ram_block = qemu_ram_alloc(size, mr, errp);
    mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
}

When we commit this region to kvm it calls ‘kvm_region_add’ and ‘kvm_set_phys_mem’ will be called. If this is not a ram, it will just return and no memslot created/updated in kvm.

static void kvm_set_phys_mem(KVMMemoryListener *kml,
                             MemoryRegionSection *section, bool add)
{
    KVMState *s = kvm_state;
    KVMSlot *mem, old;
    int err;
    MemoryRegion *mr = section->mr;
    bool writeable = !mr->readonly && !mr->rom_device;


    if (!memory_region_is_ram(mr)) {
        if (writeable || !kvm_readonly_mem_allowed) {
            return;
        } else if (!mr->romd_mode) {
            /* If the memory device is not in romd_mode, then we actually want
             * to remove the kvm memory slot so all accesses will trap. */
            add = false;
        }
    }
}

KVM part

In ‘vmx_init’, when ept enabled, it calls ‘ept_set_mmio_spte_mask’.

static void ept_set_mmio_spte_mask(void)
{
	/*
	* EPT Misconfigurations can be generated if the value of bits 2:0
	* of an EPT paging-structure entry is 110b (write/execute).
	* Also, magic bits (0x3ull << 62) is set to quickly identify mmio
	* spte.
	*/
	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
}

void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
{
	shadow_mmio_mask = mmio_mask;
}

Here set ‘shadow_mmio_mask’.

We the guest access the MMIO address, the VM will exit caused by ept violation and ‘tdp_page_fault’ will be called. ‘__direct_map’ will be called to construct the EPT page table.

After the long call-chain, the final function ‘mark_mmio_spte’ will be called to set the spte with ‘shadow_mmio_mask’ which as we already know is set when the vmx initialization.

__direct_map
    -->mmu_set_spte
        -->set_spte
            -->set_mmio_spte
                -->mark_mmio_spte

The condition to call ‘mark_mmio_spte’ is ‘is_noslot_pfn’.

static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
			  pfn_t pfn, unsigned access)
{
	if (unlikely(is_noslot_pfn(pfn))) {
		mark_mmio_spte(kvm, sptep, gfn, access);
		return true;
	}

	return false;
}

static inline bool is_noslot_pfn(pfn_t pfn)
{
	return pfn == KVM_PFN_NOSLOT;
}

As we know the QEMU doesn’t commit the MMIO memory region, so pfn is ‘KVM_PFN_NOSLOT’ and then mark the spte with ‘shadow_mmio_mask’.

When the guest later access this MMIO page, as it’s ept page table entry is 110b, this will cause the VM exit by EPT misconfig, any how can a page be write/execute but no read permission. In the handler ‘handle_ept_misconfig’ it first process the MMIO case this will dispatch to the QEMU part.

static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
{
	u64 sptes[4];
	int nr_sptes, i, ret;
	gpa_t gpa;

	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);

	ret = handle_mmio_page_fault_common(vcpu, gpa, true);
	if (likely(ret == RET_MMIO_PF_EMULATE))
		return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
					      EMULATE_DONE;

	if (unlikely(ret == RET_MMIO_PF_INVALID))
		return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);

	if (unlikely(ret == RET_MMIO_PF_RETRY))
		return 1;

	/* It is the real ept misconfig */
	printk(KERN_ERR "EPT: Misconfiguration.\n");
	printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);

	nr_sptes = kvm_mmu_get_spte_hierarchy(vcpu, gpa, sptes);

	for (i = PT64_ROOT_LEVEL; i > PT64_ROOT_LEVEL - nr_sptes; --i)
		ept_misconfig_inspect_spte(vcpu, sptes[i-1], i);

	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
	vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;

	return 0;
}


blog comments powered by Disqus