diff --git a/articles/20230916-qemu-system-machine-create.md b/articles/20230916-qemu-system-machine-create.md new file mode 100644 index 0000000000000000000000000000000000000000..9830a5f7c63768f19752a9957e4f9b609bd7949a --- /dev/null +++ b/articles/20230916-qemu-system-machine-create.md @@ -0,0 +1,486 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.2-rc2 - [refs] +> Author: jl-jiang +> Date: 2023/09/16 +> Revisor: Bin Meng +> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux) +> Proposal: [【老师提案】QEMU 系统模拟模式分析](https://gitee.com/tinylab/riscv-linux/issues/I61KIY) +> Sponsor: PLCT Lab, ISCAS + +# QEMU 机器创建及初始化机制简析 + +## 前言 + +QEMU 是一个通用的、开源的模拟器,能够通过纯软件方式实现硬件的虚拟化,模拟外部硬件,为用户提供抽象、虚拟的硬件环境,可以模拟各种硬件架构并运行多种操作系统。QEMU 的初始化是一个复杂的过程,由 `qemu_init` 函数负责完成,该过程主要包括参数解析、主循环初始化、机器创建、设备初始化等流程,本文将以 8.0.0 版本的 QEMU RISC-V (qemu-system-riscv64) 为例,分析 virt 机器创建及初始化机制。 + +本文使用以下命令在 QEMU 中创建一台 virt 机器并启动 5.18 版本的 RISC-V Linux 内核: + +```shell +qemu-system-riscv64 -M virt -m 256M -nographic \ + -kernel linux-kernel/arch/riscv/boot/Image \ + -drive file=rootfs.img,format=raw,id=hd0 \ + -device virtio-blk-device,drive=hd0 \ + -append "root=/dev/vda rw console=ttyS0" +``` + +机器创建的工作由函数 `qemu_create_machine` 完成,而机器初始化则由 `qemu_apply_legacy_machine_options` 和 `qemu_apply_machine_options` 完成,如果命令行参数没有配置 `--preconfig` 选项,则 `qmp_x_exit_preconfig` 也会承担一部分的机器初始化工作。待机器初始化完成后,QEMU 会调用机器类型对应的初始化方法对机器实例进行初始化。 + +QEMU 机器创建及初始化的基本流程如下图所示: + + + +## 机器创建 + +QEMU 在完成命令行参数的解析以及主循环的初始化后,会根据命令行指定的机器类型调用函数 `qemu_create_machine` 进行机器创建: + +```c +/* softmmu/vl.c: 2011 */ + +static void qemu_create_machine(QDict *qdict) +{ + MachineClass *machine_class = select_machine(qdict, &error_fatal); + object_set_machine_compat_props(machine_class->compat_props); + + current_machine = MACHINE(object_new_with_class(OBJECT_CLASS(machine_class))); + object_property_add_child(object_get_root(), "machine", + OBJECT(current_machine)); + object_property_add_child(container_get(OBJECT(current_machine), + "/unattached"), + "sysbus", OBJECT(sysbus_get_default())); + + if (machine_class->minimum_page_bits) { + if (!set_preferred_target_page_bits(machine_class->minimum_page_bits)) { + /* This would be a board error: specifying a minimum smaller than + * a target's compile-time fixed setting. + */ + g_assert_not_reached(); + } + } + + cpu_exec_init_all(); + page_size_init(); + + if (machine_class->hw_version) { + qemu_set_hw_version(machine_class->hw_version); + } + + /* + * Get the default machine options from the machine if it is not already + * specified either by the configuration file or by the command line. + */ + if (machine_class->default_machine_opts) { + QDict *default_opts = + keyval_parse(machine_class->default_machine_opts, NULL, NULL, + &error_abort); + qemu_apply_legacy_machine_options(default_opts); + object_set_properties_from_keyval(OBJECT(current_machine), default_opts, + false, &error_abort); + qobject_unref(default_opts); + } +} +``` + +`qemu_create_machine` 函数首先调用 `select_machine` 函数检查命令行指定的机器类型是否在 QEMU 支持的机器列表中: + +```c +/* softmmu/vl.c: 1577 */ + +static MachineClass *select_machine(QDict *qdict, Error **errp) +{ + const char *optarg = qdict_get_try_str(qdict, "type"); + GSList *machines = object_class_get_list(TYPE_MACHINE, false); + MachineClass *machine_class; + Error *local_err = NULL; + + if (optarg) { + machine_class = find_machine(optarg, machines); + qdict_del(qdict, "type"); + if (!machine_class) { + error_setg(&local_err, "unsupported machine type"); + } + } else { + machine_class = find_default_machine(machines); + if (!machine_class) { + error_setg(&local_err, "No machine specified, and there is no default"); + } + } + + g_slist_free(machines); + if (local_err) { + error_append_hint(&local_err, "Use -machine help to list supported machines\n"); + error_propagate(errp, local_err); + } + return machine_class; +} +``` + +`select_machine` 首先调用 `object_class_get_list` 函数选出由该机器类型所派生出的各种具体机器型号并储存在链表 `machines` 中: + +```c +/* qom/object.c: 1155 */ + +GSList *object_class_get_list(const char *implements_type, + bool include_abstract) +{ + GSList *list = NULL; + + object_class_foreach(object_class_get_list_tramp, + implements_type, include_abstract, &list); + return list; +} +``` + +如果 `machines` 列表不为空,则意味着命令行指定的机器类型位于 QEMU 的支持列表中,则 `select_machine` 会继续调用 `find_machine` 函数找到对应机器类型并返回: + +```c +/* softmmu/vl.c: 799 */ + +static MachineClass *find_machine(const char *name, GSList *machines) +{ + GSList *el; + + for (el = machines; el; el = el->next) { + MachineClass *mc = el->data; + + if (!strcmp(mc->name, name) || !g_strcmp0(mc->alias, name)) { + return mc; + } + } + + return NULL; +} +``` + +接下来 QEMU 会调用 `object_set_machine_compat_props` 设置 virt 机器的全局属性,然后之前选择出的机器类型创建一个该类的实例 `current_machine`,这里的 `current_machine` 是一个 `MachineState` 类型的指针: + +```c +/* include/hw/boards.h: 324 */ + +struct MachineState { + /* < private > */ + Object parent_obj; + + /* < public > */ + + void *fdt; + char *dtb; + char *dumpdtb; + int phandle_start; + char *dt_compatible; + bool dump_guest_core; + bool mem_merge; + bool usb; + bool usb_disabled; + char *firmware; + bool iommu; + bool suppress_vmdesc; + bool enable_graphics; + ConfidentialGuestSupport *cgs; + HostMemoryBackend *memdev; + /* + * convenience alias to ram_memdev_id backend memory region + * or to numa container memory region + */ + MemoryRegion *ram; + DeviceMemoryState *device_memory; + + ram_addr_t ram_size; + ram_addr_t maxram_size; + uint64_t ram_slots; + BootConfiguration boot_config; + char *kernel_filename; + char *kernel_cmdline; + char *initrd_filename; + const char *cpu_type; + AccelState *accelerator; + CPUArchIdList *possible_cpus; + CpuTopology smp; + struct NVDIMMState *nvdimms_state; + struct NumaState *numa_state; +}; +``` + +QEMU 使用 `MachineState` 结构体对机器的参数和运行状态进行描述,这里的 `current_machine` 就描述了新创建的 virt 机器,对 virt 的各类初始化操作就是对 `current_machine` 的成员变量进行操作。完成机器的实例化之后,`qemu_create_machine` 函数通过调用 `object_property_add_child` 将新创建的机器实例添加到 QEMU 的对象树中,并将默认的系统总线添加为其子对象。最后,QEMU 使用 `cpu_exec_init_all` 和 `page_size_init` 等函数对包括 IO 空间、内存映射、页面大小在内的一系列运行环境进行初始化,最终完成机器的创建,退出 `qemu_create_machine` 函数。 + +## 机器参数初始化 + +机器参数初始化环节主要是根据命令函参数解析的结果设置新创建的机器实例的参数,具体工作由 `qemu_apply_legacy_machine_options` 和 `qemu_apply_machine_options` 函数完成。`qemu_apply_legacy_machine_options` 函数负责设置不在 `MachineState` 结构体中的属性,主要包括 `accel`, `kernel-irqchip`, `memory-backend` 等命令行选项: + +```c +/* softmmu/vl.c: 1649 */ + +static void qemu_apply_legacy_machine_options(QDict *qdict) +{ + const char *value; + QObject *prop; + + keyval_dashify(qdict, &error_fatal); + + /* Legacy options do not correspond to MachineState properties. */ + ... + + prop = qdict_get(qdict, "memory"); + if (prop) { + have_custom_ram_size = + qobject_type(prop) == QTYPE_QDICT && + qdict_haskey(qobject_to(QDict, prop), "size"); + } +} +``` + +以本文开头的 QEMU 启动命令为例,其中使用 `-m 256M` 选项指定了机器内存大小,`qemu_apply_legacy_machine_options` 在遍历参数过程中会设置布尔变量 `have_custom_ram_size` 值为 `true`,这一变量在之后调用 `qemu_resolve_machine_memdev` 函数解析和设置虚拟机内存后端的过程中用来判断用户是否指定自定义的 RAM 大小。 + +`qemu_apply_machine_options` 函数主要目的是将参数解析提供的机器选项应用到当前创建的虚拟机上: + +```c +/* softmmu/vl.c: 1845 */ + +static void qemu_apply_machine_options(QDict *qdict) +{ + object_set_properties_from_keyval(OBJECT(current_machine), qdict, false, &error_fatal); + + if (semihosting_enabled(false) && !semihosting_get_argc()) { + /* fall back to the -kernel/-append */ + semihosting_arg_fallback(current_machine->kernel_filename, current_machine->kernel_cmdline); + } + + if (current_machine->smp.cpus > 1) { + replay_add_blocker("smp"); + } +} +``` + +`qemu_apply_machine_options` 函数会调用 `object_set_properties_from_keyval` 函数将 `qdict` 中的键值对设置为 `current_machine` 对象的属性,并在结束执行前检查机器是否启用 `semi-host` 模式以及是否配置了多个 CPU。 + +## 机器实例初始化 + +如果命令行参数启用了 `--preconfig` 选项,那么 QEMU 将会在完成创建初始虚拟机之前暂停,进入交互式配置状态并允许通过 [QEMU Machine Protocol][003](QMP)进行一些配置。在本文中,没有配置 `--preconfig` 选项,因此 `qemu_init` 函数在初始化工作进入尾声时会调用 `qmp_x_exit_preconfig` 函数以避免 QEMU 执行暂停: + +```c +/* softmmu/vl.c: 2602 */ + +void qmp_x_exit_preconfig(Error **errp) +{ + if (phase_check(PHASE_MACHINE_INITIALIZED)) { + error_setg(errp, "The command is permitted only before machine initialization"); + return; + } + + qemu_init_board(); + qemu_create_cli_devices(); + qemu_machine_creation_done(); + + if (loadvm) { + load_snapshot(loadvm, NULL, false, NULL, &error_fatal); + } + if (replay_mode != REPLAY_MODE_NONE) { + replay_vmstate_init(); + } + + if (incoming) { + Error *local_err = NULL; + if (strcmp(incoming, "defer") != 0) { + qmp_migrate_incoming(incoming, &local_err); + if (local_err) { + error_reportf_err(local_err, "-incoming %s: ", incoming); + exit(1); + } + } + } else if (autostart) { + qmp_cont(NULL); + } +} +``` + +其中 `qemu_init_board` 函数会进一步调用 `machine_run_board_init` 完成机器实例(主板)的一些初始化操作,包括设置内存、检查 CPU 类型、初始化加速器和其他相关的设置: + +```c +/* hw/core/machine.c: 1307 */ + +void machine_run_board_init(MachineState *machine, const char *mem_path, Error **errp) +{ + MachineClass *machine_class = MACHINE_GET_CLASS(machine); + ObjectClass *oc = object_class_by_name(machine->cpu_type); + CPUClass *cc; + + /* This checkpoint is required by replay to separate prior clock + reading from the other reads, because timer polling functions query + clock values from the log. */ + replay_checkpoint(CHECKPOINT_INIT); + + if (!xen_enabled()) { + /* On 32-bit hosts, QEMU is limited by virtual address space */ + if (machine->ram_size > (2047 << 20) && HOST_LONG_BITS == 32) { + error_setg(errp, "at most 2047 MB RAM can be simulated"); + return; + } + } + + if (machine->memdev) { + ram_addr_t backend_size = object_property_get_uint(OBJECT(machine->memdev), + "size", &error_abort); + if (backend_size != machine->ram_size) { + error_setg(errp, "Machine memory size does not match the size of the memory backend"); + return; + } + } else if (machine_class->default_ram_id && machine->ram_size && + numa_uses_legacy_mem()) { + if (!create_default_memdev(current_machine, mem_path, errp)) { + return; + } + } + + if (machine->numa_state) { + numa_complete_configuration(machine); + if (machine->numa_state->num_nodes) { + machine_numa_finish_cpu_init(machine); + } + } + + if (!machine->ram && machine->memdev) { + machine->ram = machine_consume_memdev(machine, machine->memdev); + } + + /* If the machine supports the valid_cpu_types check and the user + * specified a CPU with -cpu check here that the user CPU is supported. + */ + if (machine_class->valid_cpu_types && machine->cpu_type) { + int i; + + for (i = 0; machine_class->valid_cpu_types[i]; i++) { + if (object_class_dynamic_cast(oc, + machine_class->valid_cpu_types[i])) { + /* The user specificed CPU is in the valid field, we are + * good to go. + */ + break; + } + } + + if (!machine_class->valid_cpu_types[i]) { + /* The user specified CPU is not valid */ + error_report("Invalid CPU type: %s", machine->cpu_type); + error_printf("The valid types are: %s", + machine_class->valid_cpu_types[0]); + for (i = 1; machine_class->valid_cpu_types[i]; i++) { + error_printf(", %s", machine_class->valid_cpu_types[i]); + } + error_printf("\n"); + + exit(1); + } + } + + /* Check if CPU type is deprecated and warn if so */ + cc = CPU_CLASS(oc); + if (cc && cc->deprecation_note) { + warn_report("CPU model %s is deprecated -- %s", machine->cpu_type, + cc->deprecation_note); + } + + if (machine->cgs) { + /* + * With confidential guests, the host can't see the real + * contents of RAM, so there's no point in it trying to merge + * areas. + */ + machine_set_mem_merge(OBJECT(machine), false, &error_abort); + + /* + * Virtio devices can't count on directly accessing guest + * memory, so they need iommu_platform=on to use normal DMA + * mechanisms. That requires also disabling legacy virtio + * support for those virtio pci devices which allow it. + */ + object_register_sugar_prop(TYPE_VIRTIO_PCI, "disable-legacy", + "on", true); + object_register_sugar_prop(TYPE_VIRTIO_DEVICE, "iommu_platform", + "on", false); + } + + accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); + machine_class->init(machine); + phase_advance(PHASE_MACHINE_INITIALIZED); +} +``` + +在上述代码中需要关注的是通过 `machine_class->init` 来调用机器类型的初始化方法对机器实例 `machine` 进行初始化,通过动态调试我们可以发现,这里实际调用的是 `virt_machine_init` 函数: + +```c +/* hw/riscv/virt.c: 1332 */ + +static void virt_machine_init(MachineState *machine) +{ + const MemMapEntry *memmap = virt_memmap; + RISCVVirtState *s = RISCV_VIRT_MACHINE(machine); + MemoryRegion *system_memory = get_system_memory(); + MemoryRegion *mask_rom = g_new(MemoryRegion, 1); + char *soc_name; + DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip; + int i, base_hartid, hart_count; + int socket_count = riscv_socket_count(machine); + + /* Check socket count limit */ + ... + + /* Initialize sockets */ + ... + + /* register system main memory (actual RAM) */ + memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base, + machine->ram); + + /* boot rom */ + memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom", + memmap[VIRT_MROM].size, &error_fatal); + memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base, + mask_rom); + + /* + * Init fw_cfg. Must be done before riscv_load_fdt, otherwise the + * device tree cannot be altered and we get FDT_ERR_NOSPACE. + */ + s->fw_cfg = create_fw_cfg(machine); + rom_set_fw(s->fw_cfg); + + /* SiFive Test MMIO device */ + sifive_test_create(memmap[VIRT_TEST].base); + + /* VirtIO MMIO devices */ + ... + + /* load/create device tree */ + if (machine->dtb) { + machine->fdt = load_device_tree(machine->dtb, &s->fdt_size); + if (!machine->fdt) { + error_report("load_device_tree() failed"); + exit(1); + } + } else { + create_fdt(s, memmap); + } + + s->machine_done.notify = virt_machine_done; + qemu_add_machine_init_done_notifier(&s->machine_done); +} +``` + +`virt_machine_init` 函数是 QEMU 中 RISC-V 架构机器 virt 的初始化函数,它首先会检查插槽数量,初始化插槽并每个插槽创建一个中断控制器,然后根据 32 位或 64 位设置 PCIe 内存映射参数,接着注册系统主内存并初始化引导 ROM。此后,会初始化 virt 机器的一系列的设备,包括 VirtIO MMIO 设备、PCIe 设备、平台总线、串行设备、RTC 和闪存设备等,最后加载或创建设备树并设置机器初始化完成通知符告知 QEMU 机器(主板)初始化完成。 + +当机器初始化工作全部完成后,由 `qemu_machine_creation_done` 函数执行一系列后置的检查和状态同步操作,通知 QEMU 机器初始化工作全部完成且通过检查,执行流程可以进入下一阶段。 + +## 总结 + +本文梳理了 QEMU 机器创建及初始化的基本流程,并以 RISC-V 架构的 virt 平台为例,从机器类型选择、机器参数初始化和机器实例初始化三个环节切入,深入分析了 QEMU 机器创建机制的技术细节,打通了 QEMU 中虚拟机创建流程的内在逻辑。 + +## 参考资料 + +- 《QEMU/KVM 源码解析与应用》李强,机械工业出版社 +- [QEMU 启动方式分析(3): QEMU 代码与 RISCV 'virt' 平台 ZSBL 分析][001] +- [QEMU RISC-V virt 平台分析][002] +- [QEMU QMP Reference Manual][003] +- [‘virt’ Generic Virtual Platform][004] + +[001]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20220911-qemu-riscv-zsbl.md +[002]: https://juejin.cn/post/6891922292075397127 +[003]: https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html +[004]: https://www.qemu.org/docs/master/system/riscv/virt.html diff --git a/articles/images/qemu-system-machine-create/machine-init.svg b/articles/images/qemu-system-machine-create/machine-init.svg new file mode 100644 index 0000000000000000000000000000000000000000..785ee136c206162ebc3cb629c1686e93484b82d9 --- /dev/null +++ b/articles/images/qemu-system-machine-create/machine-init.svg @@ -0,0 +1,4 @@ + + + +qemu_create_machineqemu_create_machineSELECT MACHINE TYPESELECT MACHINE TYPEselect_machineselect_machineobject_class_get_listobject_class_get_listfind_machinefind_machineINITI MACHINEINITI MACHINEobject_set_machine_compat_propsobject_set_machine_compat_propsobject_property_add_childobject_property_add_childcpu_exec_init_allcpu_exec_init_all page_size_initpage_size_initMACHINEMACHINEINIT BOARDINIT BOARDqemu_init_boardqemu_init_boardmachine_run_board_initmachine_run_board_initqemu_create_cli_devicesqemu_create_cli_devicesqemu_machine_creation_doneqemu_machine_creation_doneQEMU INIT PROCESSQEMU INIT PROCESSqmp_x_exit_preconfigqmp_x_exit_preconfigqemu_apply_legacy_machine_optionsqemu_apply_legacy_machine_optionsqemu_apply_machine_optionsqemu_apply_machine_optionsFINISH MACHINE CREATIONFINISH MACHINE CREATIONText is not SVG - cannot display \ No newline at end of file