diff --git a/articles/20221011-riscv-kvm-mem-virt-impl.md b/articles/20221011-riscv-kvm-mem-virt-impl.md new file mode 100644 index 0000000000000000000000000000000000000000..13c34e5fc7312c48eb056c1d3405053b48878420 --- /dev/null +++ b/articles/20221011-riscv-kvm-mem-virt-impl.md @@ -0,0 +1,936 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.1 - [spaces images urls]
+> Author: XiakaiPan <13212017962@163.com>
+> Date: 2022/10/11
+> Revisor: walimis, Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal: [RISC-V 虚拟化技术调研与分析](https://gitee.com/tinylab/riscv-linux/issues/I5E4VB)
+> Sponsor: PLCT Lab, ISCAS + +# RISC 内存虚拟化在 KVM 及 kvmtool 中的实现 + +## 前言 + +在 RISC-V 特权指令级中,H 扩展规定了实现虚拟化支持所需的一系列 CSR、指令和机制,[前文][3] 在指令集层面就内存虚拟化的相关机制进行了分析,在此基础之上,本文将讨论 KVM 中关于这一系列机制的具体实现。 + +## 软件版本 + +| Software | commit ID or version No. | Link | +|--------------|------------------------------------------|------------------------------------| +| Linux Kernel | 5.19-rc5 | https://www.kernel.org/ | +| kvmtool | 6a1f699108e5c2a280d7cd1f1ae4816b8250a29f | https://github.com/kvmtool/kvmtool | + +## KVM 中的 G-Stage 地址转换实现 + +KVM 对外提供了用于创建设备的接口 `kvm_dev_ioctl`,kvmtool 之类的外部虚拟机管理程序通过调用 KVM 提供的对应接口创建虚拟机。KVM 本身则以 RISC-V 特权指令集为标准,实现了 RISC-V 的虚拟化机制。RISC-V 将 Guest 虚拟地址转换为 Host 的物理地址的这一过程划分为两个阶段,即 VS-Stage 和 G-Stage,其中 VS-Stage 与常见的支持 M/S/U 三种模式的机器的地址转换机制一致,而 G-Stage 由于需要考虑 Hypervisor 对多个虚拟机的地址空间的分配,所以需要额外引入其他机制对上述分配进行管理,这也正是虚拟化实现中需要特别考虑的地方。而 KVM 对于 RISC-V 虚拟化的支持,相较于其他架构的实现,就体现在实现了一套 RISC-V 标准的虚拟机创建与管理机制。 + +本节将对 KVM 中与 G-Stage 地址转换相关的代码进行分析。创建 KVM 虚拟机需要调用 `virt/kvm/kvm_main.c/kvm_create_vm` 函数,该函数内部则是通过 `arch/riscv/kvm/vcpu_exit.c/kvm_arch_init_vm` 来做架构初始化的,初始化的过程就是调用对应函数为虚拟机申请地址空间、初始化 vmid (Virtual Machine InDex)、初始化 Guest 计时器。下面将分析逐个分析与虚拟机内存管理相关的内存申请、CSR 修改、页缺陷处理、HFENCE 指令。 + +![Gstage Address Translation](images/riscv-kvm/vm-impl/gstage-at.png) + +### 地址定义 + +在 KVM 的 RISC-V 虚拟化实现里,将分别使用 GVA, GPA 表示 VM 中的虚拟地址、物理地址,使用 HVA,HPA 表示 Host 中的虚拟地址、物理地址,使用 GFN 和 HFN 表示 Guest 和 Host 当前地址所在的物理页的页号,如下所示: + +```cpp +/* + * Address types: + * + * gva - guest virtual address + * gpa - guest physical address + * gfn - guest frame number + * hva - host virtual address + * hpa - host physical address + * hfn - host frame number + */ + +typedef unsigned long gva_t; +typedef u64 gpa_t; +typedef u64 gfn_t; + +#define GPA_INVALID (~(gpa_t)0) + +typedef unsigned long hva_t; +typedef u64 hpa_t; +typedef u64 hfn_t; +``` + +### 为虚拟机申请内存 + +`arch/riscv/kvm/vm.c` 的 `kvm_arch_init_vm` 函数调用 `arch/riscv/kvm/mmu.c` 中的 `kvm_riscv_gstage_alloc_pgd(struct kvm *kvm)` 函数为虚拟机申请内存,具体来说是将被 Hypervisor 做分页管理的内存空间分配(ALLOcate)给虚拟机,表现为返回给 KVM 虚拟机一个页目录(PaGe Directory)。这一过程发生在 Hypervisor 的内存管理即 G-Stage 地址转换过程中。代码实现如下: + +```cpp +// arch/riscv/kvm/mmu.c: line 712 + +int kvm_riscv_gstage_alloc_pgd(struct kvm *kvm) +{ + struct page *pgd_page; + // 是否已经为 VM 分配了目录页号 + if (kvm->arch.pgd != NULL) { + kvm_err("kvm_arch already initialized?\n"); + return -EINVAL; + } + // 分配 + pgd_page = alloc_pages(GFP_KERNEL | __GFP_ZERO, + get_order(gstage_pgd_size)); + if (!pgd_page) + return -ENOMEM; + kvm->arch.pgd = page_to_virt(pgd_page); + kvm->arch.pgd_phys = page_to_phys(pgd_page); + + return 0; +} +``` + +其中 `alloc_pages` 函数中用到的参数相关的宏定义如下: + +```cpp +// include/linux/gfp_types.h: line 333 +#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) +// include/linux/gfp_types.h: line 249 +#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) + +// arch/riscv/include/asm/csr.h: line 139 +#define HGATP_PAGE_SHIFT 12 +// include/linux/gfp_types.h: line 32 +#define gstage_pgd_xbits 2 +#define gstage_pgd_size (1UL << (HGATP_PAGE_SHIFT + gstage_pgd_xbits)) +``` + +`alloc_pages` 函数定义如下: + +```cpp +// include/linux/gfp.h: line 275 +static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order) +{ + return alloc_pages_node(numa_node_id(), gfp_mask, order); +} + +// include/linux/gfp.h: line 260 +/* + * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE, + * prefer the current CPU's closest node. Otherwise node must be valid and + * online. + */ +static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, + unsigned int order) +{ + if (nid == NUMA_NO_NODE) + nid = numa_mem_id(); + + return __alloc_pages_node(nid, gfp_mask, order); +} + +// include/linux/gfp.h: line 237 +/* + * Allocate pages, preferring the node given as nid. The node must be valid and + * online. For more general interface, see alloc_pages_node(). + */ +static inline struct page * +__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) +{ + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); + VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); + + return __alloc_pages(gfp_mask, order, nid, NULL); +} +``` + +最终执行页分配的函数是 `__alloc_pages`,该函数是 Linux OS 的 Buddy 内存管理系统的核心函数之一,其定义如下。以下关于该函数的分析以及对应的 Linux Buddy System 的解读参考自 [此文][007]。 + +Linux 中的内存管理从大到小可以分为 node、zone、page 三个级别。其中 page(页)是分页内存机制和底层内存分配的最小单元,大小为 4K,物理页的编号叫做 pfn(page frame number)。 + +```mermaid +flowchart + +subgraph node0 + +subgraph zone0 + +subgraph page00 +end +page01 +page0N[...] +end + +subgraph zone1 +subgraph page10 +end +page11 +page1N[...] +end + +zoneN[...] + +end +``` + +([下载由 Mermaid 生成的 PNG 图片][008]) + +伙伴内存系统(Buddy System)是对物理内存进行分配的算法,它的基本管理单位是区域(zone),最小分配粒度是页面(page)。但伙伴系统本身并不直接管理页帧,而是管理由多个页帧组成的页块(pageblock),一个 n 阶(order)的页块包含了 $2^n$ 个页帧,n 的大小为 0 到 10。伙伴系统的所有分配接口最终都会使用、\_\_alloc_pages 这个函数来进行分配。 + +```cpp +// mm/page_alloc.c: line 5513 +/* + * This is the 'heart' of the zoned buddy allocator. + */ +struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, + nodemask_t *nodemask) +{ + struct page *page; + + /* 页分配之前的准备工作,此处代码略去 */ + /* Preparation code, omitted here */ + // ... + + /* 先从 freelist 中申请内存 */ + /* First allocation attempt */ + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); + if (likely(page)) + goto out; + + // ... + + /* 如果从 freelist 中申请失败,则需要在内存回收后进行分配,即 slow path */ + page = __alloc_pages_slowpath(alloc_gfp, order, &ac); + +out: + + // ... + + return page; +} +``` + +### 释放虚拟机内存 + +与虚拟机的内存申请相似,KVM 释放内存也是通过调用 Linux 的 Buddy System 的内存释放 API 来实现的,具体来说 `arch/riscv/kvm/mmu.c/kvm_riscv_gstage_free_pgd()` 函数调用 `mm/page_alloc.c` 中的 `free_pages()` 函数释放对应虚拟机的内存。 + +```cpp +// arch/riscv/kvm/mmu.c: line 731 +void kvm_riscv_gstage_free_pgd(struct kvm *kvm) +{ + void *pgd = NULL; + + spin_lock(&kvm->mmu_lock); + if (kvm->arch.pgd) { + gstage_unmap_range(kvm, 0UL, gstage_gpa_size, false); + pgd = READ_ONCE(kvm->arch.pgd); + kvm->arch.pgd = NULL; + kvm->arch.pgd_phys = 0; + } + spin_unlock(&kvm->mmu_lock); + + if (pgd) + free_pages((unsigned long)pgd, get_order(gstage_pgd_size)); +} +``` + +Buddy System 释放内存的 API 调用如下方代码所示: + +```cpp +// mm/page_alloc.c: 5641 +void free_pages(unsigned long addr, unsigned int order) +{ + if (addr != 0) { + VM_BUG_ON(!virt_addr_valid((void *)addr)); + __free_pages(virt_to_page((void *)addr), order); + } +} + +// mm/page_alloc.c: 5631 +void __free_pages(struct page *page, unsigned int order) +{ + if (put_page_testzero(page)) + free_the_page(page, order); + else if (!PageHead(page)) + while (order-- > 0) + free_the_page(page + (1 << order), order); +} + +// mm/page_alloc.c: 764 +static inline void free_the_page(struct page *page, unsigned int order) +{ + if (pcp_allowed_order(order)) /* Via pcp? */ + free_unref_page(page, order); + else + __free_pages_ok(page, order, FPI_NONE); +} +``` + +### G-Stage Page Fault + +G-Stage 的页错误处理函数定义在 `arch/riscv/kvm/vcpu_exit.c` 中,定义如下。其中涉及的 MMIO 处理函数 `kvm_riscv_vcpu_mmio_load`,`kvm_riscv_vcpu_mmio_store` 此处不予讨论,下面分析如何通过调用 `kvm_riscv_gstage_map` 函数实现 G-Stage 的地址映射。 + +```cpp +// arch/riscv/kvm/vcpu_exit.c: line 12 +static int gstage_page_fault(struct kvm_vcpu *vcpu, struct kvm_run *run, + struct kvm_cpu_trap *trap) +{ + struct kvm_memory_slot *memslot; + unsigned long hva, fault_addr; + bool writable; + gfn_t gfn; + int ret; + + // 从 trap 信息中获取页错误的地址 gpa (Guest Physical Address) + // get page fault address from trap information + fault_addr = (trap->htval << 2) | (trap->stval & 0x3); + + // 将发生错误的地址先转换为 Guest 页号(Guest Frame Number)、再转换为可以在 hypervisor 中进行处理的 hva(Hypervisor Virtual Address) + gfn = fault_addr >> PAGE_SHIFT; + memslot = gfn_to_memslot(vcpu->kvm, gfn); + // 返回 gfn 对应的 hva 及其读写属性 + // Return the hva of a @gfn and the R/W attribute if possible + hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable); + + // 依据 scause CSR 的值判定当前的 page fault 是何类型(Load/Store)并进行相应处理 + if (kvm_is_error_hva(hva) || + (trap->scause == EXC_STORE_GUEST_PAGE_FAULT && !writable)) { + switch (trap->scause) { + case EXC_LOAD_GUEST_PAGE_FAULT: + return kvm_riscv_vcpu_mmio_load(vcpu, run, + fault_addr, + trap->htinst); + case EXC_STORE_GUEST_PAGE_FAULT: + return kvm_riscv_vcpu_mmio_store(vcpu, run, + fault_addr, + trap->htinst); + default: + return -EOPNOTSUPP; + }; + } + + // 进行 G-Stage 的地址映射 + ret = kvm_riscv_gstage_map(vcpu, memslot, fault_addr, hva, + (trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false); + if (ret < 0) + return ret; + + return 1; +} +``` + +传入的参数中 `trap` 保存了此次 page fault 的具体信息,其定义如下。其中 `sepc`, `scause`, `stval` 是复用了非虚拟化时的 S-Mode CSR,而 `htval` 和 `htinst` 这两个 CSR 则是 H 扩展中为了支持 G-Stage 而添加的。 + +```cpp +// arch/riscv/include/asm/kvm_host.h +struct kvm_cpu_trap { + unsigned long sepc; + unsigned long scause; + unsigned long stval; + unsigned long htval; + unsigned long htinst; +}; +``` + +`kvm_riscv_gstage_map` 函数通过如下三个部分实现了地址映射。 + +第一部分是 `mmap_read_lock(current->mm);` 和 `mmap_read_unlock(current->mm);` 之间的代码,这部分通过 hva 初始化 vma(Virtual Memory Area,虚拟内存区域,用于表示具有特定 page-fault 处理方式的 Virtual Memory Space 的任意部分,参见 `include/linux/mm_types.h` 403 行定义)最终确定 gfn 的值。 + +第二部分是调用 `gfn_to_pfn_prot` 用 gfn 的值初始化 hfn 的值。 + +第三部分则是 `spin_lock(&kvm->mmu_lock);` 和 `spin_unlock(&kvm->mmu_lock);` 之间的代码段,用于更新 MMU:如果此次 page-fault 是要求无效化特定存储项的,则跳转到 `out_unlock` 部分设置并清除 hfn 项,否则的话将会根据此次 page-fault 对应页帧的可写状态调用 `gstage_map_page` 函数对 gpa 和 hpa(代码中即为 `hfn << PAGE_SHIFT`)进行映射。 + +```cpp +// arch/riscv/kvm/mmu.c: line 617 +int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, + struct kvm_memory_slot *memslot, + gpa_t gpa, unsigned long hva, bool is_write) +{ + int ret; + kvm_pfn_t hfn; + bool writable; + short vma_pageshift; + gfn_t gfn = gpa >> PAGE_SHIFT; + struct vm_area_struct *vma; + struct kvm *kvm = vcpu->kvm; + struct kvm_mmu_memory_cache *pcache = &vcpu->arch.mmu_page_cache; + bool logging = (memslot->dirty_bitmap && + !(memslot->flags & KVM_MEM_READONLY)) ? true : false; + unsigned long vma_pagesize, mmu_seq; + + mmap_read_lock(current->mm); + + vma = find_vma_intersection(current->mm, hva, hva + 1); + if (unlikely(!vma)) { + kvm_err("Failed to find VMA for hva 0x%lx\n", hva); + mmap_read_unlock(current->mm); + return -EFAULT; + } + + if (is_vm_hugetlb_page(vma)) + vma_pageshift = huge_page_shift(hstate_vma(vma)); + else + vma_pageshift = PAGE_SHIFT; + vma_pagesize = 1ULL << vma_pageshift; + if (logging || (vma->vm_flags & VM_PFNMAP)) + vma_pagesize = PAGE_SIZE; + + if (vma_pagesize == PMD_SIZE || vma_pagesize == PGDIR_SIZE) + gfn = (gpa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT; + + mmap_read_unlock(current->mm); + + if (vma_pagesize != PGDIR_SIZE && + vma_pagesize != PMD_SIZE && + vma_pagesize != PAGE_SIZE) { + kvm_err("Invalid VMA page size 0x%lx\n", vma_pagesize); + return -EFAULT; + } + + /* We need minimum second+third level pages */ + ret = kvm_mmu_topup_memory_cache(pcache, gstage_pgd_levels); + if (ret) { + kvm_err("Failed to topup G-stage cache\n"); + return ret; + } + + mmu_seq = kvm->mmu_invalidate_seq; + + hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writable); + if (hfn == KVM_PFN_ERR_HWPOISON) { + send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, + vma_pageshift, current); + return 0; + } + if (is_error_noslot_pfn(hfn)) + return -EFAULT; + + /* + * If logging is active then we allow writable pages only + * for write faults. + */ + if (logging && !is_write) + writable = false; + + spin_lock(&kvm->mmu_lock); + + if (mmu_invalidate_retry(kvm, mmu_seq)) + goto out_unlock; + + if (writable) { + kvm_set_pfn_dirty(hfn); + mark_page_dirty(kvm, gfn); + ret = gstage_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT, + vma_pagesize, false, true); + } else { + ret = gstage_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT, + vma_pagesize, true, true); + } + + if (ret) + kvm_err("Failed to map in G-stage\n"); + +out_unlock: + spin_unlock(&kvm->mmu_lock); + kvm_set_pfn_accessed(hfn); + kvm_release_pfn_clean(hfn); + return ret; +} +``` + +下面分析实现 gstage page-fault 处理的核心函数 `gstage_map_page`,其代码实现如下: + +```cpp +// arch/riscv/kvm/mmu.c: line 177 +static int gstage_map_page(struct kvm *kvm, + struct kvm_mmu_memory_cache *pcache, + gpa_t gpa, phys_addr_t hpa, + unsigned long page_size, + bool page_rdonly, bool page_exec) +{ + int ret; + u32 level = 0; + pte_t new_pte; + pgprot_t prot; + + // 根据 page_size 确定页所在层级(level),如果没有对应大小的页,则返回非 0 值 + ret = gstage_page_size_to_level(page_size, &level); + if (ret) + return ret; + + /* + * A RISC-V implementation can choose to either: + * 1) Update 'A' and 'D' PTE bits in hardware + * 2) Generate page fault when 'A' and/or 'D' bits are not set + * PTE so that software can update these bits. + * + * We support both options mentioned above. To achieve this, we + * always set 'A' and 'D' PTE bits at time of creating G-stage + * mapping. To support KVM dirty page logging with both options + * mentioned above, we will write-protect G-stage PTEs to track + * dirty pages. + */ + + /* 基于 RISC-V 指令集手册的 PTE 更新机制的实现可以有两种选择, + * 即在硬件中更新页表项(PTE)的 A(Access)/D(Dirty)位, + * 或默认 PTE 的 A/D 位不设置,当访问未初始化的 PTE 时产生 page-fault + * 从而使软件更新上述标志位。 + * + * KVM 的实现中同时支持了上述两种机制:在进行 G-Stage 地址映射时就初始化 + * 对应标志位,同时通过对 G-Stage PTEs 的写保护达成了软件层面的脏页追踪机制。 + */ + + // 获取要执行操作的页的访问权限 + if (page_exec) { + if (page_rdonly) + prot = PAGE_READ_EXEC; + else + prot = PAGE_WRITE_EXEC; + } else { + if (page_rdonly) + prot = PAGE_READ; + else + prot = PAGE_WRITE; + } + // 设置 hpa 对应页的权限位并标记为 dirty + new_pte = pfn_pte(PFN_DOWN(hpa), prot); + new_pte = pte_mkdirty(new_pte); + + // 设置此次 page-fault 处理的页的内容 + return gstage_set_pte(kvm, level, pcache, gpa, &new_pte); +} +``` + +在 `gstage_set_pte` 内部,逐级遍历 hypervisor 的页表,直至到达制定 level 的页表,之后对其进行操作(赋值、视是否为叶子结点刷新 TLB): + +```cpp +// arch/riscv/kvm/mmu.c: line 137 +static int gstage_set_pte(struct kvm *kvm, u32 level, + struct kvm_mmu_memory_cache *pcache, + gpa_t addr, const pte_t *new_pte) +{ + u32 current_level = gstage_pgd_levels - 1; + pte_t *next_ptep = (pte_t *)kvm->arch.pgd; + pte_t *ptep = &next_ptep[gstage_pte_index(addr, current_level)]; + + if (current_level < level) + return -EINVAL; + + while (current_level != level) { + if (gstage_pte_leaf(ptep)) + return -EEXIST; + + // 若当前页表项无效,则根据 pcache 的内容有效性选择申请页表项(kvm_mmu_memory_cache_alloc)或直接返回错误代码 + if (!pte_val(*ptep)) { + if (!pcache) + return -ENOMEM; + next_ptep = kvm_mmu_memory_cache_alloc(pcache); + if (!next_ptep) + return -ENOMEM; + *ptep = pfn_pte(PFN_DOWN(__pa(next_ptep)), + __pgprot(_PAGE_TABLE)); + } else { + if (gstage_pte_leaf(ptep)) + return -EEXIST; + next_ptep = (pte_t *)gstage_pte_page_vaddr(*ptep); + } + + current_level--; + ptep = &next_ptep[gstage_pte_index(addr, current_level)]; + } + + // 为找到的页表项赋值(保存之前设置的权限位和脏页标志位) + *ptep = *new_pte; + // 倘若为叶子页表,刷新 TLB 对应项 + if (gstage_pte_leaf(ptep)) + gstage_remote_tlb_flush(kvm, current_level, addr); + + return 0; +} +``` + +其中,逐级遍历找到下一级的页表项是通过交替更新 next_ptep 和 ptep 实现的,next_ptep 首先初始化为 page directory 即根页表,之后通过 `gstage_pte_index` 函数获得要操作的地址在当前层级页内的页表项索引,最终获得对应的当前层级的页表项。 + +```cpp +// arch/riscv/kvm/mmu.c: line 42 +static inline unsigned long gstage_pte_index(gpa_t addr, u32 level) +{ + unsigned long mask; + unsigned long shift = HGATP_PAGE_SHIFT + (gstage_index_bits * level); + + if (level == (gstage_pgd_levels - 1)) + mask = (PTRS_PER_PTE * (1UL << gstage_pgd_xbits)) - 1; + else + mask = PTRS_PER_PTE - 1; + + return (addr >> shift) & mask; +} +``` + +处理一个虚拟地址对应的 page-fault 意味着要更新其对应的 TLB 项,KVM 内部实现了内存操作的扩展指令集中的 `hfence.gvma`,该指令有 rs1 和 rs2 两个源操作数,分别指定了 Guest Address 和上述地址对应的 Guest 所在的 VM 的 ID(Index)。H 扩展相关的指令在 KVM 中的实现将在 [后续小节][1] 进行分析。 + +```cpp +// arch/riscv/kvm/mmu.c: line 126 +static void gstage_remote_tlb_flush(struct kvm *kvm, u32 level, gpa_t addr) +{ + unsigned long order = PAGE_SHIFT; + + if (gstage_level_to_page_order(level, &order)) + return; + addr &= ~(BIT(order) - 1); + + kvm_riscv_hfence_gvma_vmid_gpa(kvm, -1UL, 0, addr, BIT(order), order); +} +``` + +### HGATP 更新 + +`hgatp` 的结构及其功能参见 [内存虚拟化一文][2] 对应章节,区别于 `satp` 和 `vsatp` 中由 `ASID` 来保存 Hypervisor/Supervisor 和 Guest 中的地址空间的索引值,`hgatp` 对应区域规定为 `VMID`,用于保存虚拟机的索引值。 + +下面将结合 KVM 中 `hgatp` 的更新函数的具体实现及其调用,详细分析该 CSR 的功能。 + +`kvm_riscv_gstage_update_hgatp` 函数定义在 `mmu.c` 中: + +```cpp +// arch/riscv/kvm/mmu.c: line 748 +void kvm_riscv_gstage_update_hgatp(struct kvm_vcpu *vcpu) +{ + // 设置当前地址系统(SV32,SV39,etc.)对应的 hgatp 初始值 + unsigned long hgatp = gstage_mode; + // 获取指向当前 vcpu 的架构信息的指针 + struct kvm_arch *k = &vcpu->kvm->arch; + + // 设置 hgatp.VMID 位 + hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) & + HGATP_VMID_MASK; + // 设置 hgatp.PPN 位 + hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN; + + // 更新 hgatp 的值 + csr_write(CSR_HGATP, hgatp); + + // 若当前 vmid 无效,使用 hfence.gvma 指令刷新全部 TLB 的项 + if (!kvm_riscv_gstage_vmid_bits()) + kvm_riscv_local_hfence_gvma_all(); +} +``` + +如果为 64 位机器默认使用 SV39 的地址系统,否则若为 32 位机器则默认使用 SV32 的地址系统。 + +```cpp +#ifdef CONFIG_64BIT +static unsigned long gstage_mode = (HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT); +static unsigned long gstage_pgd_levels = 3; +#define gstage_index_bits 9 +#else +static unsigned long gstage_mode = (HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT); +static unsigned long gstage_pgd_levels = 2; +#define gstage_index_bits 10 +#endif +``` + +涉及 `hgatp` 的更新有两种情况:`kvm_arch_vcpu_create` -> `kvm_riscv_reset_vcpu` -> `kvm_arch_vcpu_load` -> `kvm_riscv_gstage_update_hgatp` 和 `kvm_arch_vcpu_ioctl_run` -> `kvm_riscv_check_vcpu_requests` -> `kvm_riscv_gstage_update_hgatp`,即在创建 vCPU 时初始化、vCPU 运行时处理来自 Guest 的请求(sleep,reset,fence,update hgatp,etc.) + +```cpp +// arch/riscv/kvm/vcpu.c: line 915 +int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) +{ + int ret; + + // ... + + vcpu_load(vcpu); + + kvm_sigset_activate(vcpu); + + ret = 1; + run->exit_reason = KVM_EXIT_UNKNOWN; + /* 处理 vCPU 内部请求的循环 */ + while (ret > 0) { + // ... + + /* 更新 VMID(内部将根据 VM 做出更新 hgatp 等请求)*/ + kvm_riscv_gstage_vmid_update(vcpu); + /* 处理各个 vCPU 内部的请求 */ + kvm_riscv_check_vcpu_requests(vcpu); + + // ... + + ret = kvm_riscv_vcpu_exit(vcpu, run, &trap); + } + + kvm_sigset_deactivate(vcpu); + + vcpu_put(vcpu); + + kvm_vcpu_srcu_read_unlock(vcpu); + + return ret; +} + +``` + +其中 vmid 的更新函数如下: + +```cpp +// arch/riscv/kvm/vmid.c: line 71 +void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu) +{ + unsigned long i; + struct kvm_vcpu *v; + struct kvm_vmid *vmid = &vcpu->kvm->arch.vmid; + + /* 视情况更新 vmid 版本并刷新 TLB */ + if (!kvm_riscv_gstage_vmid_ver_changed(vmid)) + return; + spin_lock(&vmid_lock); + // ... + spin_unlock(&vmid_lock); + + /* 为每一个 vCPU 更新页表的刷新请求 */ + /* Request G-stage page table update for all VCPUs */ + kvm_for_each_vcpu(i, v, vcpu->kvm) + kvm_make_request(KVM_REQ_UPDATE_HGATP, v); +} +``` + +在更新了 VMID 及其对应的 vCPU 内部的处理请求之后,将通过调用 `kvm_riscv_check_vcpu_requests` 函数进行处理: + +```cpp +// arch/riscv/kvm/vcpu.c: line 848 +static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu) +{ + struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu); + + if (kvm_request_pending(vcpu)) { + /* sleep, reset requests handling */ + // ... + + if (kvm_check_request(KVM_REQ_UPDATE_HGATP, vcpu)) + kvm_riscv_gstage_update_hgatp(vcpu); + + /* Memory management requests (fence.i, hfence.gvma, hfence.vvma, etc.) handling */ + // ... + } +} +``` + +### HFENCE 扩展指令的实现 + +与 H 扩展相关的内存管理指令包含了 `HINVAL` 扩展 和 `HFENCE` 扩展,其指令格式和功能参见 [此文][3]。`HINVAL` 指令 KVM 中并未予以实现,`HFENCE` 指令则在 `tlb.c` 中通过调用 `make_xfence_request` 实现。以 `hfence.gvma` 为例,其实现如下: + +```cpp +// arch/riscv/kvm/tlb.c: line 388 +void kvm_riscv_hfence_gvma_vmid_gpa(struct kvm *kvm, + unsigned long hbase, unsigned long hmask, + gpa_t gpa, gpa_t gpsz, + unsigned long order) +{ + struct kvm_riscv_hfence data; + + data.type = KVM_RISCV_HFENCE_GVMA_VMID_GPA; + data.asid = 0; + data.addr = gpa; + data.size = gpsz; + data.order = order; + make_xfence_request(kvm, hbase, hmask, KVM_REQ_HFENCE, + KVM_REQ_HFENCE_GVMA_VMID_ALL, &data); +} +``` + +KVM 的实现中,根据指令对应的不同的使用场景,可大致分为如下四类实现:all 对应由 VMID 或 ASID 所指定的 TLB 的所有项,gpa 和 gva 则分别进一步指定了要处理的 TLB 项对应的 Guest/VM 地址。 +| hfence | Guest/Virtual | vmid/asid | address space | +|------------------------|---------------|-----------|---------------| +| kvm_riscv_local_hfence | gvma | vmid | gpa | +| kvm_riscv_local_hfence | gvma | vmid | all | +| kvm_riscv_local_hfence | vvma | asid | gva | +| kvm_riscv_local_hfence | vvma | asid | all | + +`xfence` 的定义如下: + +```cpp +// arch/riscv/kvm/tlb.c: line 345 +static void make_xfence_request(struct kvm *kvm, + unsigned long hbase, unsigned long hmask, + unsigned int req, unsigned int fallback_req, + const struct kvm_riscv_hfence *data) +{ + // ... + + /* 将每个 vCPU 的 hfence 信息入队 */ + kvm_for_each_vcpu(i, vcpu, kvm) { + // ... + if (!vcpu_hfence_enqueue(vcpu, data)) + actual_req = fallback_req; + } + + /* 若队满无法全部入队,则直接提交名为 fallback_req 的请求,对于指定了 VMID 和 GPA 的 hfence.gvma 指令而言,其 fallback_req 为刷新对应 VMID 的全部 TLB 项的 `KVM_REQ_HFENCE_GVMA_VMID_ALL`,以此保证即便不能做到精细化的内存管理,也可以通过粗粒度的指令达成所需的效果。*/ + kvm_make_vcpus_request_mask(kvm, actual_req, vcpu_mask); +} +``` + +### 总览 + +整个过程中的调用关系如下图所示: + +```mermaid +flowchart LR + +subgraph arch/riscv/kvm/mmu.c + +alloc_pgd[kvm_riscv_gstage_alloc_pgd] +free_pgd[kvm_riscv_gstage_free_pgd] + +unmap[gstage_unmap_range]-->free_pgd + +leaf[gstage_get_leaf_entry]-->free_pgd +l2s[gstage_level_to_page_size]-->op_pte +op_pte[gstage_op_pte]-->free_pgd + +flush[gstage_remote_tlb_flush]-->op_pte + +gmap[kvm_riscv_gstage_map] + +mode_dtct[kvm_riscv_gstage_mode_detect] +mode[kvm_riscv_gstage_mode] + +update_hgatp[kvm_riscv_gstage_update_hgatp] + +gpa_bits[kvm_riscv_gstage_gpa_bits] + +update_hgatp[kvm_riscv_gstage_update_hgatp] + +flush_shadow[kvm_arch_flush_shadow_all] + +pgva[gstage_pte_page_vaddr]-->leaf + +set_pte[gstage_set_pte]-->mappg[gstage_map_page] +set_pte-->ioremap[kvm_riscv_gstage_ioremap] + +mappg-->set_gfn[kvm_set_spte_gfn] +mappg-->gmap[kvm_riscv_gstage_map] + +end + +subgraph arch/riscv/kvm/vmid.c +vmid_init[kvm_riscv_gstage_vmid_init] +bits[kvm_riscv_gstage_vmid_bits] +vmid_dtct[kvm_riscv_gstage_vmid_detect] +vmid_change[kvm_riscv_gstage_vmid_ver_changed] +update[kvm_riscv_gstage_vmid_update] +end + +subgraph arch/riscv/kvm/vm.c +ivm[kvm_arch_init_vm] +check_ext[kvm_vm_ioctl_check_extension] +end + +subgraph virt/kvm/kvm_main.c +cvm[kvm_create_vm] +mem[kvm_mmu_topup_memory_cache] +dev_vm[kvm_dev_ioctl_create_vm] +dev[kvm_dev_ioctl] + +init[kvm_init] + +set_gfn-->mn_pte[kvm_mmu_notifier_change_pte] +flush_shadow-->flush_all[kvm_flush_shadow_all] + +cvm-->dev_vm +dev_vm-->dev + +end + +dev-->external_call[kvmtool, etc.] + +subgraph arch/riscv/kvm/vcpu_exit.c +pgfault[gstage_page_fault] +end + +subgraph arch/riscv/kvm/main.c +rkinit[riscv_kvm_init] +archi[kvm_arch_init] +end + +archi-->init-->rkinit + +subgraph arch/riscv/kvm/tlb.c +tlb_sntz[kvm_riscv_local_tlb_sanitize] +end + +subgraph arch/riscv/kvm/vcpu.c +run[kvm_arch_vcpu_ioctl_run] +load[kvm_arch_vcpu_load] +check_vcpu_req[kvm_riscv_check_vcpu_requests] +end + +alloc_pgd-->ivm +vmid_init-->ivm +free_pgd-->ivm + +ivm-->cvm + +gmap-->pgfault + +mode_dtct-->archi +vmid_dtct-->archi +mode-->archi + +bits-->archi +bits-->update_hgatp +bits-->tlb_sntz + +vmid_change-->run +update-->run + +gpa_bits-->check_ext + +update_hgatp-->load +update_hgatp-->check_vcpu_req + +mem-->gmap + +``` + +([下载由 Mermaid 生成的 PNG 图片][009]) + +## kvmtool + +kvmtool 为 Guest 申请内存的行为并不太多涉及特定架构的虚拟化支持的细节,kvmtool 初始化 VM 内存的函数如下所示: + +```cpp +// riscv/kvm.c: line 64 +void kvm__arch_init(struct kvm *kvm) +{ + /* 申请 Guest 内存。Buffer 做 64K 对齐,如使用了 THP(Transparent Huge Page)则按 2M 对齐 */ + + /* 确定 Guest 内存的起始位置与大小 */ + kvm->ram_size = min(kvm->cfg.ram_size, (u64)RISCV_MAX_MEMORY(kvm)); + kvm->arch.ram_alloc_size = kvm->ram_size + SZ_2M; + kvm->arch.ram_alloc_start = mmap_anon_or_hugetlbfs(kvm, + kvm->cfg.hugetlbfs_path, + kvm->arch.ram_alloc_size); + if (kvm->arch.ram_alloc_start == MAP_FAILED) + die("Failed to map %lld bytes for guest memory (%d)", + kvm->arch.ram_alloc_size, errno); + kvm->ram_start = (void *)ALIGN((unsigned long)kvm->arch.ram_alloc_start, + SZ_2M); + + /* 为 Guest 申请特定类型的内存 */ + madvise(kvm->arch.ram_alloc_start, kvm->arch.ram_alloc_size, + MADV_MERGEABLE); + madvise(kvm->arch.ram_alloc_start, kvm->arch.ram_alloc_size, + MADV_HUGEPAGE); +} +``` + +## 结语 + +本文对 KVM 如何实现 RISC-V G-Stage 地址转换进行了分析,包括为 Guest 申请内存、处理 G-Stage 页错误、使用 HFENCE 指令管理内存、更新 HGATP 寄存器、释放 VM 的内存,后续可作为 RISC-V 虚拟化的软件实现的参考。 + +## 参考资料 + +- [RISC-V 特权指令集手册][4] +- [RISC-V Linux][5] +- [kvmtools][6] + +[1]: 20221011-riscv-kvm-mem-virt-impl.md#h-扩展特殊指令的实现 +[2]: 20220812-riscv-kvm-mem-virt-1.md#hgatp +[3]: 20220812-riscv-kvm-mem-virt-2.md +[4]: https://riscv.org/technical/specifications/privileged-isa/ +[5]: https://gitee.com/tinylab/riscv-linux +[6]: https://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git +[007]: https://mp.weixin.qq.com/s/nlMGEhuaDUYqV6r8A4cRlA +[008]: images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-1.png +[009]: images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-2.png diff --git a/articles/images/riscv-kvm/vm-impl/gstage-at.png b/articles/images/riscv-kvm/vm-impl/gstage-at.png new file mode 100644 index 0000000000000000000000000000000000000000..2d438ed97c9768e5023daa1d62f913a7fb31561a Binary files /dev/null and b/articles/images/riscv-kvm/vm-impl/gstage-at.png differ diff --git a/articles/images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-1.png b/articles/images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-1.png new file mode 100644 index 0000000000000000000000000000000000000000..468dfa2d388796496ce427ca9a162cb20df9eb23 Binary files /dev/null and b/articles/images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-1.png differ diff --git a/articles/images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-2.png b/articles/images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-2.png new file mode 100644 index 0000000000000000000000000000000000000000..6e0578448a73209d03faac04f91d3549af9fef98 Binary files /dev/null and b/articles/images/riscv-kvm/vm-impl/mermaid-riscv-kvm-mem-virt-impl-2.png differ