diff --git a/articles/20230901-riscv-isa-discovery-5-linux.md b/articles/20230901-riscv-isa-discovery-5-linux.md new file mode 100644 index 0000000000000000000000000000000000000000..745ddafd82c6f2ac1eeb75191fe847f18eacb203 --- /dev/null +++ b/articles/20230901-riscv-isa-discovery-5-linux.md @@ -0,0 +1,828 @@ +> Author: YJMSTR [jay1273062855@outlook.com](mailto:jay1273062855@outlook.com)
+> Date: 2023/09/01
+> Revisor: Bin Meng, Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Sponsor: ISCAS + +# Linux RISC-V ISA 扩展支持 + +## 前言 + +本文是 RISC-V 扩展软硬件支持系列的第 5 篇文章,将介绍 Linux 内核对 RISC-V 扩展的检测与支持方式。建议在阅读本文之前先阅读本系列第一篇文章:[《RISC-V 当前指令集扩展类别与检测方式》][003] 以对 RISC-V ISA 扩展目前的命名、分类与硬件检测方式有所了解。 + +RISC-V 之前使用名为 misa 的 CSR 来检测 ISA 扩展,在 misa 中分配了 26 位,每一位用于标识某扩展/特权模式是否启用,但随着扩展数目增加,misa 的位数不够了。RISC-V 后来引入了一个名为 mconfigptr 的 CSR,该 CSR 中存放有一个地址,指向包含硬件信息的数据结构,固件可以利用这个数据结构来生成 SMBIOS/设备树/ACPI。 + +SMBIOS 中存放的 ISA 信息仅包含 misa 中的那些,还是不够用。设备树通过一个 ISA string 来表示 ISA 扩展组合,ACPI 则是包含有一个 RHCT(RISC-V Hart Capabilities Table),其中同样包含了一个 ISA string 节点。 + +PLCT 在 2022 年曾经做过相关的[调研工作][008],当时的 Linux 内核可以通过 cpuinfo/环境变量/HWCAP/SIGILL 等方式在用户空间获取 ISA 扩展信息。 + +[本系列的上一篇文章][007]中介绍了 OpenSBI 的 RISC-V ISA 扩展检测情况,SBI 位于 M 模式,能够直接读取相关 CSR 来获得相应的扩展信息。 + +## 获取 Linux 源码 + +Linux 内核较大,使用 `git fetch` 可以断点续传,以免网络出问题导致 `git clone` 中断。 + +```sh +$ mkdir linux-kernel +$ cd linux-kernel +$ git init +$ git fetch https://gitee.com/mirrors/linux_old1.git +$ git checkout FETCH_HEAD +$ git remote add origin https://gitee.com/mirrors/linux_old1.git +$ git pull origin +``` + +`git checkout` 的输出表明:HEAD is now at 2dde18cd1d8f Linux 6.5,本文将基于这一版本进行分析。如果我们想要获取特定版本的 Linux 内核源码,可以从 [kernel.org][001] 上下载,也可以在 `git pull origin` 之后通过 `git checkout` 切换到指定版本。 + +## 编译内核并启动 + +编译内核需要用到 RISC-V 交叉编译工具链,本系列之前的文章已经介绍过,此处不再赘述: + +```sh +$ make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- defconfig +$ make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- -j $(nproc) +``` + +嵌入式领域通常使用 busybox 来构建根文件系统,首先下载编译 busybox: + +```sh +$ git clone https://gitee.com/mirrors/busyboxsource +$ cd busyboxsource +$ export CROSS_COMPILE=riscv64-linux-gnu- +$ make defconfig +$ make menuconfig +# 这里启用了 Settings-->Build Options 里的 Build static binary (no shared libs) 选项 +$ make -j $(nproc) +$ make install +``` + +制作文件系统并新建一个启动脚本: + +```sh +$ cd ~ +$ qemu-img create rootfs.img 1g +$ mkfs.ext4 rootfs.img +$ mkdir rootfs +$ sudo mount -o loop rootfs.img rootfs +$ cd rootfs +$ sudo cp -r ../busyboxsource/_install/* . +$ sudo mkdir proc sys dev etc etc/init.d +$ cd etc/init.d/ +$ sudo touch rcS +$ sudo vi rcS +``` + +编辑启动脚本 rcS 中的内容如下: + +```sh +#!/bin/sh +mount -t proc none /proc +mount -t sysfs none /sys +/sbin/mdev -s +``` + +并修改文件权限: + +```sh +$ sudo chmod +x rcS +$ cd ~ +$ sudo umount rootfs +``` + +随后尝试直接引导内核: + +```sh +$ qemu-system-riscv64 -M virt -m 256M -nographic -kernel linux-kernel/arch/riscv/boot/Image -drive file=rootfs.img,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -append "root=/dev/vda rw console=ttyS0" +``` + +可以得到 Linux 启动日志如下: + +```sh +$ qemu-system-riscv64 -M virt -m 256M -nographic -kernel linux-kernel/arch/riscv/boot/Image -drive file=rootfs.img,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -append "root=/dev/vda rw console=ttyS0" + +OpenSBI v1.3.1 + ____ _____ ____ _____ + / __ \ / ____| _ \_ _| + | | | |_ __ ___ _ __ | (___ | |_) || | + | | | | '_ \ / _ \ '_ \ \___ \| _ < | | + | |__| | |_) | __/ | | |____) | |_) || |_ + \____/| .__/ \___|_| |_|_____/|___/_____| + | | + |_| + +Platform Name : riscv-virtio,qemu +Platform Features : medeleg +Platform HART Count : 1 +Platform IPI Device : aclint-mswi +Platform Timer Device : aclint-mtimer @ 10000000Hz +Platform Console Device : uart8250 +Platform HSM Device : --- +Platform PMU Device : --- +Platform Reboot Device : sifive_test +Platform Shutdown Device : sifive_test +Platform Suspend Device : --- +Platform CPPC Device : --- +Firmware Base : 0x80000000 +Firmware Size : 194 KB +Firmware RW Offset : 0x20000 +Firmware RW Size : 66 KB +Firmware Heap Offset : 0x28000 +Firmware Heap Size : 34 KB (total), 2 KB (reserved), 9 KB (used), 22 KB (free) +Firmware Scratch Size : 4096 B (total), 760 B (used), 3336 B (free) +Runtime SBI Version : 1.0 + +Domain0 Name : root +Domain0 Boot HART : 0 +Domain0 HARTs : 0* +Domain0 Region00 : 0x0000000002000000-0x000000000200ffff M: (I,R,W) S/U: () +Domain0 Region01 : 0x0000000080000000-0x000000008001ffff M: (R,X) S/U: () +Domain0 Region02 : 0x0000000080020000-0x000000008003ffff M: (R,W) S/U: () +Domain0 Region03 : 0x0000000000000000-0xffffffffffffffff M: (R,W,X) S/U: (R,W,X) +Domain0 Next Address : 0x0000000080200000 +Domain0 Next Arg1 : 0x000000008fe00000 +Domain0 Next Mode : S-mode +Domain0 SysReset : yes +Domain0 SysSuspend : yes + +Boot HART ID : 0 +Boot HART Domain : root +Boot HART Priv Version : v1.12 +Boot HART Base ISA : rv64imafdch +Boot HART ISA Extensions : time,sstc +Boot HART PMP Count : 16 +Boot HART PMP Granularity : 4 +Boot HART PMP Address Bits: 54 +Boot HART MHPM Count : 16 +Boot HART MIDELEG : 0x0000000000001666 +Boot HART MEDELEG : 0x0000000000f0b509 +[ 0.000000] Linux version 6.5.0 (mint@linux-lab-host) (riscv64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #1 SMP Wed Aug 30 17:20:35 CST 2023 +[ 0.000000] random: crng init done +[ 0.000000] Machine model: riscv-virtio,qemu +[ 0.000000] SBI specification v1.0 detected +[ 0.000000] SBI implementation ID=0x1 Version=0x10003 +[ 0.000000] SBI TIME extension detected +[ 0.000000] SBI IPI extension detected +[ 0.000000] SBI RFENCE extension detected +[ 0.000000] SBI SRST extension detected +[ 0.000000] efi: UEFI not found. +[ 0.000000] OF: reserved mem: 0x0000000080000000..0x000000008001ffff (128 KiB) nomap non-reusable mmode_resv0@80000000 +[ 0.000000] OF: reserved mem: 0x0000000080020000..0x000000008003ffff (128 KiB) nomap non-reusable mmode_resv1@80020000 +[ 0.000000] Zone ranges: +[ 0.000000] DMA32 [mem 0x0000000080000000-0x000000008fffffff] +[ 0.000000] Normal empty +[ 0.000000] Movable zone start for each node +[ 0.000000] Early memory node ranges +[ 0.000000] node 0: [mem 0x0000000080000000-0x000000008003ffff] +[ 0.000000] node 0: [mem 0x0000000080040000-0x000000008fffffff] +[ 0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x000000008fffffff] +[ 0.000000] SBI HSM extension detected +[ 0.000000] riscv: base ISA extensions acdfhim +[ 0.000000] riscv: ELF capabilities acdfim +[ 0.000000] percpu: Embedded 19 pages/cpu s40888 r8192 d28744 u77824 +[ 0.000000] Kernel command line: root=/dev/vda rw console=ttyS0 +[ 0.000000] Dentry cache hash table entries: 32768 (order: 6, 262144 bytes, linear) +[ 0.000000] Inode-cache hash table entries: 16384 (order: 5, 131072 bytes, linear) +[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 64512 +[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off +[ 0.000000] Virtual kernel memory layout: +[ 0.000000] fixmap : 0xff1bfffffea00000 - 0xff1bffffff000000 (6144 kB) +[ 0.000000] pci io : 0xff1bffffff000000 - 0xff1c000000000000 ( 16 MB) +[ 0.000000] vmemmap : 0xff1c000000000000 - 0xff20000000000000 (1024 TB) +[ 0.000000] vmalloc : 0xff20000000000000 - 0xff60000000000000 (16384 TB) +[ 0.000000] modules : 0xffffffff0157b000 - 0xffffffff80000000 (2026 MB) +[ 0.000000] lowmem : 0xff60000000000000 - 0xff60000010000000 ( 256 MB) +[ 0.000000] kernel : 0xffffffff80000000 - 0xffffffffffffffff (2047 MB) +[ 0.000000] Memory: 218332K/262144K available (8728K kernel code, 4974K rwdata, 4096K rodata, 2200K init, 482K bss, 43812K reserved, 0K cma-reserved) +[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 +[ 0.000000] rcu: Hierarchical RCU implementation. +[ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=1. +[ 0.000000] rcu: RCU debug extended QS entry/exit. +[ 0.000000] Tracing variant of Tasks RCU enabled. +[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies. +[ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1 +[ 0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 +[ 0.000000] riscv-intc: 64 local interrupts mapped +[ 0.000000] plic: plic@c000000: mapped 95 interrupts with 1 handlers for 2 contexts. +[ 0.000000] riscv: providing IPIs using SBI IPI extension +[ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention. +[ 0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns +[ 0.000073] sched_clock: 64 bits at 10MHz, resolution 100ns, wraps every 4398046511100ns +[ 0.000182] riscv-timer: Timer interrupt in S-mode is available via sstc extension +[ 0.008216] Console: colour dummy device 80x25 +[ 0.009505] Calibrating delay loop (skipped), value calculated using timer frequency.. 20.00 BogoMIPS (lpj=40000) +[ 0.009635] pid_max: default: 32768 minimum: 301 +[ 0.010714] LSM: initializing lsm=capability,integrity +[ 0.012870] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear) +[ 0.012948] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear) +[ 0.040252] RCU Tasks Trace: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1. +[ 0.040655] riscv: ELF compat mode supported +[ 0.041099] ASID allocator using 16 bits (65536 entries) +[ 0.042073] rcu: Hierarchical SRCU implementation. +[ 0.042107] rcu: Max phase no-delay instances is 1000. +[ 0.044384] EFI services will not be available. +[ 0.045645] smp: Bringing up secondary CPUs ... +[ 0.046625] smp: Brought up 1 node, 1 CPU +[ 0.057137] devtmpfs: initialized +[ 0.064078] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns +[ 0.064309] futex hash table entries: 256 (order: 2, 16384 bytes, linear) +[ 0.066112] pinctrl core: initialized pinctrl subsystem +[ 0.072724] NET: Registered PF_NETLINK/PF_ROUTE protocol family +[ 0.080456] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations +[ 0.080750] DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations +[ 0.081016] audit: initializing netlink subsys (disabled) +[ 0.084218] thermal_sys: Registered thermal governor 'step_wise' +[ 0.084758] cpuidle: using governor menu +[ 0.085806] audit: type=2000 audit(0.040:1): state=initialized audit_enabled=0 res=1 +[ 0.099344] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages +[ 0.099384] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page +[ 0.103034] ACPI: Interpreter disabled. +[ 0.104629] iommu: Default domain type: Translated +[ 0.104662] iommu: DMA domain TLB invalidation policy: strict mode +[ 0.106649] SCSI subsystem initialized +[ 0.108322] usbcore: registered new interface driver usbfs +[ 0.108532] usbcore: registered new interface driver hub +[ 0.108678] usbcore: registered new device driver usb +[ 0.119431] vgaarb: loaded +[ 0.139122] clocksource: Switched to clocksource riscv_clocksource +[ 0.141072] pnp: PnP ACPI: disabled +[ 0.158856] NET: Registered PF_INET protocol family +[ 0.159756] IP idents hash table entries: 4096 (order: 3, 32768 bytes, linear) +[ 0.164351] tcp_listen_portaddr_hash hash table entries: 128 (order: 0, 4096 bytes, linear) +[ 0.164450] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear) +[ 0.164501] TCP established hash table entries: 2048 (order: 2, 16384 bytes, linear) +[ 0.164727] TCP bind hash table entries: 2048 (order: 5, 131072 bytes, linear) +[ 0.164988] TCP: Hash tables configured (established 2048 bind 2048) +[ 0.165942] UDP hash table entries: 256 (order: 2, 24576 bytes, linear) +[ 0.166364] UDP-Lite hash table entries: 256 (order: 2, 24576 bytes, linear) +[ 0.167395] NET: Registered PF_UNIX/PF_LOCAL protocol family +[ 0.170224] RPC: Registered named UNIX socket transport module. +[ 0.170280] RPC: Registered udp transport module. +[ 0.170291] RPC: Registered tcp transport module. +[ 0.170305] RPC: Registered tcp-with-tls transport module. +[ 0.170315] RPC: Registered tcp NFSv4.1 backchannel transport module. +[ 0.170463] PCI: CLS 0 bytes, default 64 +[ 0.177633] workingset: timestamp_bits=46 max_order=16 bucket_order=0 +[ 0.180895] NFS: Registering the id_resolver key type +[ 0.181717] Key type id_resolver registered +[ 0.181750] Key type id_legacy registered +[ 0.181976] nfs4filelayout_init: NFSv4 File Layout Driver Registering... +[ 0.182049] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering... +[ 0.182691] 9p: Installing v9fs 9p2000 file system support +[ 0.184022] NET: Registered PF_ALG protocol family +[ 0.184294] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 246) +[ 0.184406] io scheduler mq-deadline registered +[ 0.184462] io scheduler kyber registered +[ 0.184546] io scheduler bfq registered +[ 0.187615] pci-host-generic 30000000.pci: host bridge /soc/pci@30000000 ranges: +[ 0.188285] pci-host-generic 30000000.pci: IO 0x0003000000..0x000300ffff -> 0x0000000000 +[ 0.188845] pci-host-generic 30000000.pci: MEM 0x0040000000..0x007fffffff -> 0x0040000000 +[ 0.188906] pci-host-generic 30000000.pci: MEM 0x0400000000..0x07ffffffff -> 0x0400000000 +[ 0.189388] pci-host-generic 30000000.pci: Memory resource size exceeds max for 32 bits +[ 0.189752] pci-host-generic 30000000.pci: ECAM at [mem 0x30000000-0x3fffffff] for [bus 00-ff] +[ 0.191528] pci-host-generic 30000000.pci: PCI host bridge to bus 0000:00 +[ 0.191752] pci_bus 0000:00: root bus resource [bus 00-ff] +[ 0.191831] pci_bus 0000:00: root bus resource [io 0x0000-0xffff] +[ 0.191884] pci_bus 0000:00: root bus resource [mem 0x40000000-0x7fffffff] +[ 0.191897] pci_bus 0000:00: root bus resource [mem 0x400000000-0x7ffffffff] +[ 0.193325] pci 0000:00:00.0: [1b36:0008] type 00 class 0x060000 +[ 0.287202] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled +[ 0.297113] printk: console [ttyS0] disabled +[ 0.300265] 10000000.serial: ttyS0 at MMIO 0x10000000 (irq = 12, base_baud = 230400) is a 16550A +[ 0.301524] printk: console [ttyS0] enabled +[ 0.323403] SuperH (H)SCI(F) driver initialized +[ 0.343215] loop: module loaded +[ 0.344307] virtio_blk virtio0: 1/0/0 default/read/poll queues +[ 0.348817] virtio_blk virtio0: [vda] 2097152 512-byte logical blocks (1.07 GB/1.00 GiB) +[ 0.384041] e1000e: Intel(R) PRO/1000 Network Driver +[ 0.384260] e1000e: Copyright(c) 1999 - 2015 Intel Corporation. +[ 0.387880] usbcore: registered new interface driver uas +[ 0.388206] usbcore: registered new interface driver usb-storage +[ 0.389412] mousedev: PS/2 mouse device common for all mice +[ 0.393353] goldfish_rtc 101000.rtc: registered as rtc0 +[ 0.394317] goldfish_rtc 101000.rtc: setting system clock to 2023-09-02T07:35:32 UTC (1693640132) +[ 0.398165] syscon-poweroff poweroff: pm_power_off already claimed for sbi_srst_power_off +[ 0.399069] syscon-poweroff: probe of poweroff failed with error -16 +[ 0.401754] sdhci: Secure Digital Host Controller Interface driver +[ 0.401947] sdhci: Copyright(c) Pierre Ossman +[ 0.402639] sdhci-pltfm: SDHCI platform and OF driver helper +[ 0.403484] usbcore: registered new interface driver usbhid +[ 0.403664] usbhid: USB HID core driver +[ 0.404356] riscv-pmu-sbi: SBI PMU extension is available +[ 0.405010] riscv-pmu-sbi: 16 firmware and 18 hardware counters +[ 0.405216] riscv-pmu-sbi: Perf sampling/filtering is not supported as sscof extension is not available +[ 0.410466] NET: Registered PF_INET6 protocol family +[ 0.418128] Segment Routing with IPv6 +[ 0.418632] In-situ OAM (IOAM) with IPv6 +[ 0.419337] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver +[ 0.422999] NET: Registered PF_PACKET protocol family +[ 0.424737] 9pnet: Installing 9P2000 support +[ 0.425341] Key type dns_resolver registered +[ 0.466679] debug_vm_pgtable: [debug_vm_pgtable ]: Validating architecture page table helpers +[ 0.476006] clk: Disabling unused clocks +[ 0.569745] EXT4-fs (vda): recovery complete +[ 0.572787] EXT4-fs (vda): mounted filesystem b1c2c62f-2f6d-4da5-af97-58c3648c79f4 r/w with ordered data mode. Quota mode: disabled. +[ 0.573412] VFS: Mounted root (ext4 filesystem) on device 254:0. +[ 0.576056] devtmpfs: mounted +[ 0.618132] Freeing unused kernel image (initmem) memory: 2200K +[ 0.618994] Run /sbin/init as init process + +Please press Enter to activate this console. + +``` + +## Base ISA extensions & ELF capabilities + +Linux Kernel 的启动日志中有两行输出如下: + +```sh +[ 0.000000] riscv: base ISA extensions acdfhim +[ 0.000000] riscv: ELF capabilities acdfim +``` + +在源码中以 `base ISA extensions` 作为关键字进行搜索,可以发现 ISA 扩展信息相关的函数位于 `arch/riscv/kernel/cpufeature.c` 这一文件中。这部分代码是在 [commit 6bcff51][006] 引入的,其中 riscv_isa 这个 bitmap 用于表示主机上所有 CPU 支持的 ISA 扩展的交集,而 elf_hwcap 仅用于表示与用户空间有关的 ISA 扩展的交集。 + +相关的函数比较长,下面通过添加中文注释的方式进行分析: + +```c +/* arch/riscv/kernel/cpufeature.c:102 */ + +void __init riscv_fill_hwcap(void) +{ + // hwcap 指 hardware capability + struct device_node *node; + const char *isa; + char print_str[NUM_ALPHA_EXTS + 1]; + int i, j, rc; + unsigned long isa2hwcap[26] = {0}; + struct acpi_table_header *rhct; + acpi_status status; + unsigned int cpu; + // COMPAT_HWCAP_ISA_? == (1 << (字母 ? 的 ASCII 码 - 'A')) + isa2hwcap['i' - 'a'] = COMPAT_HWCAP_ISA_I; + isa2hwcap['m' - 'a'] = COMPAT_HWCAP_ISA_M; + isa2hwcap['a' - 'a'] = COMPAT_HWCAP_ISA_A; + isa2hwcap['f' - 'a'] = COMPAT_HWCAP_ISA_F; + isa2hwcap['d' - 'a'] = COMPAT_HWCAP_ISA_D; + isa2hwcap['c' - 'a'] = COMPAT_HWCAP_ISA_C; + isa2hwcap['v' - 'a'] = COMPAT_HWCAP_ISA_V; + + elf_hwcap = 0; + + // 将 riscv_isa 这个 bitmap 清零 + bitmap_zero(riscv_isa, RISCV_ISA_EXT_MAX); + + // 检测是否启用了 ACPI,如果是,读取 ACPI status + if (!acpi_disabled) { + status = acpi_get_table(ACPI_SIG_RHCT, 0, &rhct); + if (ACPI_FAILURE(status)) + return; + } + + for_each_possible_cpu(cpu) { + // struct riscv_isainfo 结构体中包含了一个名为 isa 的长为 64 位的 bitmap + struct riscv_isainfo *isainfo = &hart_isa[cpu]; + // 当前 cpu 的 hwcap + unsigned long this_hwcap = 0; + + // 如果没有启用 ACPI,就读取设备树以获取 isa 信息 + if (acpi_disabled) { + // 获取设备树结点 + node = of_cpu_device_node_get(cpu); + if (!node) { + pr_warn("Unable to find cpu node\n"); + continue; + } + // 从设备树中读出 isa 字符串 + rc = of_property_read_string(node, "riscv,isa", &isa); + of_node_put(node); + if (rc) { + pr_warn("Unable to find \"riscv,isa\" devicetree entry\n"); + continue; + } + } else { + // 如果启用了 ACPI,通过 ACPI 获取 isa 字符串 + rc = acpi_get_riscv_isa(rhct, cpu, &isa); + if (rc < 0) { + pr_warn("Unable to get ISA for the hart - %d\n", cpu); + continue; + } + } + + /* + * For all possible cpus, we have already validated in + * the boot process that they at least contain "rv" and + * whichever of "32"/"64" this kernel supports, and so this + * section can be skipped. + */ + // isa 字符串必包含 rv+位数共 4 个字符,可以跳过对这四个的检查 + isa += 4; + + while (*isa) { + const char *ext = isa++; + const char *ext_end = isa; + // ext_long 表示是否有多字母扩展 + bool ext_long = false, ext_err = false; + + // 逐字符检查 isa 字符串中的扩展 + switch (*ext) { + case 's': + /* + * Workaround for invalid single-letter 's' & 'u'(QEMU). + * No need to set the bit in riscv_isa as 's' & 'u' are + * not valid ISA extensions. It works until multi-letter + * extension starting with "Su" appears. + */ + + if (ext[-1] != '_' && ext[1] == 'u') { + ++isa; + ext_err = true; + break; + } + fallthrough; + case 'S': + case 'x': + case 'X': + case 'z': + case 'Z': + /* + * Before attempting to parse the extension itself, we find its end. + * As multi-letter extensions must be split from other multi-letter + * extensions with an "_", the end of a multi-letter extension will + * either be the null character or the "_" at the start of the next + * multi-letter extension. + * + * Next, as the extensions version is currently ignored, we + * eliminate that portion. This is done by parsing backwards from + * the end of the extension, removing any numbers. This may be a + * major or minor number however, so the process is repeated if a + * minor number was found. + * + * ext_end is intended to represent the first character *after* the + * name portion of an extension, but will be decremented to the last + * character itself while eliminating the extensions version number. + * A simple re-increment solves this problem. + */ + ext_long = true; + for (; *isa && *isa != '_'; ++isa) + if (unlikely(!isalnum(*isa))) + ext_err = true; + + ext_end = isa; + if (unlikely(ext_err)) + break; + + if (!isdigit(ext_end[-1])) + break; + + while (isdigit(*--ext_end)) + ; + + if (tolower(ext_end[0]) != 'p' || !isdigit(ext_end[-1])) { + ++ext_end; + break; + } + + while (isdigit(*--ext_end)) + ; + + ++ext_end; + break; + default: + /* + * Things are a little easier for single-letter extensions, as they + * are parsed forwards. + * + * After checking that our starting position is valid, we need to + * ensure that, when isa was incremented at the start of the loop, + * that it arrived at the start of the next extension. + * + * If we are already on a non-digit, there is nothing to do. Either + * we have a multi-letter extension's _, or the start of an + * extension. + * + * Otherwise we have found the current extension's major version + * number. Parse past it, and a subsequent p/minor version number + * if present. The `p` extension must not appear immediately after + * a number, so there is no fear of missing it. + * + */ + if (unlikely(!isalpha(*ext))) { + ext_err = true; + break; + } + + if (!isdigit(*isa)) + break; + + while (isdigit(*++isa)) + ; + + if (tolower(*isa) != 'p') + break; + + if (!isdigit(*++isa)) { + --isa; + break; + } + + while (isdigit(*++isa)) + ; + + break; + } + + /* + * The parser expects that at the start of an iteration isa points to the + * first character of the next extension. As we stop parsing an extension + * on meeting a non-alphanumeric character, an extra increment is needed + * where the succeeding extension is a multi-letter prefixed with an "_". + */ + if (*isa == '_') + ++isa; + +#define SET_ISA_EXT_MAP(name, bit) \ + do { \ + if ((ext_end - ext == sizeof(name) - 1) && \ + !strncasecmp(ext, name, sizeof(name) - 1) && \ + riscv_isa_extension_check(bit)) \ + set_bit(bit, isainfo->isa); \ + } while (false) \ + + if (unlikely(ext_err)) + continue; + if (!ext_long) { + // 如果没有多字母扩展 + int nr = tolower(*ext) - 'a'; + if (riscv_isa_extension_check(nr)) { + // 设置 hwcap + this_hwcap |= isa2hwcap[nr]; + set_bit(nr, isainfo->isa); + } + } else { + // 判断多字母扩展,检测并设置相应的 bitmap + // riscv_isa_extension_check 函数会额外检测 Zicbom 与 Zicboz 扩展的一些限制 + /* sorted alphabetically */ + SET_ISA_EXT_MAP("smaia", RISCV_ISA_EXT_SMAIA); + SET_ISA_EXT_MAP("ssaia", RISCV_ISA_EXT_SSAIA); + SET_ISA_EXT_MAP("sscofpmf", RISCV_ISA_EXT_SSCOFPMF); + SET_ISA_EXT_MAP("sstc", RISCV_ISA_EXT_SSTC); + SET_ISA_EXT_MAP("svinval", RISCV_ISA_EXT_SVINVAL); + SET_ISA_EXT_MAP("svnapot", RISCV_ISA_EXT_SVNAPOT); + SET_ISA_EXT_MAP("svpbmt", RISCV_ISA_EXT_SVPBMT); + SET_ISA_EXT_MAP("zba", RISCV_ISA_EXT_ZBA); + SET_ISA_EXT_MAP("zbb", RISCV_ISA_EXT_ZBB); + SET_ISA_EXT_MAP("zbs", RISCV_ISA_EXT_ZBS); + SET_ISA_EXT_MAP("zicbom", RISCV_ISA_EXT_ZICBOM); + SET_ISA_EXT_MAP("zicboz", RISCV_ISA_EXT_ZICBOZ); + SET_ISA_EXT_MAP("zihintpause", RISCV_ISA_EXT_ZIHINTPAUSE); + } +#undef SET_ISA_EXT_MAP + } + + /* + * These ones were as they were part of the base ISA when the + * port & dt-bindings were upstreamed, and so can be set + * unconditionally where `i` is in riscv,isa on DT systems. + */ + + // 如果未启用 ACPI,直接无条件将以下扩展在 bitmap 中标记为已启用 + if (acpi_disabled) { + set_bit(RISCV_ISA_EXT_ZICSR, isainfo->isa); + set_bit(RISCV_ISA_EXT_ZIFENCEI, isainfo->isa); + set_bit(RISCV_ISA_EXT_ZICNTR, isainfo->isa); + set_bit(RISCV_ISA_EXT_ZIHPM, isainfo->isa); + } + + /* + * All "okay" hart should have same isa. Set HWCAP based on + * common capabilities of every "okay" hart, in case they don't + * have. + */ + // 取出各个 CPU hwcap 的交集作为 elf_hwcap + if (elf_hwcap) + elf_hwcap &= this_hwcap; + else + elf_hwcap = this_hwcap; + + // 取出各个 CPU ISA 扩展的 bitmap 的交集作为 riscv_isa + if (bitmap_empty(riscv_isa, RISCV_ISA_EXT_MAX)) + bitmap_copy(riscv_isa, isainfo->isa, RISCV_ISA_EXT_MAX); + else + bitmap_and(riscv_isa, riscv_isa, isainfo->isa, RISCV_ISA_EXT_MAX); + } + // 如果启用了 ACPI,并且存在 RHCT + // RHCT 指 RISC-V Hart Capabilities Table,是 RISC-V CPU 与 OS 之间交流 CPU 的功能特性时使用的表格结构 + if (!acpi_disabled && rhct) + acpi_put_table((struct acpi_table_header *)rhct); + + /* We don't support systems with F but without D, so mask those out + * here. */ + // Linux 要求若启用了 F 扩展,D 扩展必须同时启用,否则就关闭 F 扩展 + if ((elf_hwcap & COMPAT_HWCAP_ISA_F) && !(elf_hwcap & COMPAT_HWCAP_ISA_D)) { + pr_info("This kernel does not support systems with F but not D\n"); + elf_hwcap &= ~COMPAT_HWCAP_ISA_F; + } + + if (elf_hwcap & COMPAT_HWCAP_ISA_V) { + riscv_v_setup_vsize(); + /* + * ISA string in device tree might have 'v' flag, but + * CONFIG_RISCV_ISA_V is disabled in kernel. + * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled. + */ + // 如果 config 里没有启用 V 扩展,但是设备树的 ISA string 里包含了 V 扩展,则在 elf_hwcap 中将 V 扩展标记为未启用 + if (!IS_ENABLED(CONFIG_RISCV_ISA_V)) + elf_hwcap &= ~COMPAT_HWCAP_ISA_V; + } + + memset(print_str, 0, sizeof(print_str)); + for (i = 0, j = 0; i < NUM_ALPHA_EXTS; i++) + if (riscv_isa[0] & BIT_MASK(i)) + print_str[j++] = (char)('a' + i); + // 按照字典序输出启用的单字母扩展 + pr_info("riscv: base ISA extensions %s\n", print_str); + + memset(print_str, 0, sizeof(print_str)); + for (i = 0, j = 0; i < NUM_ALPHA_EXTS; i++) + if (elf_hwcap & BIT_MASK(i)) + print_str[j++] = (char)('a' + i); + // 按照字典序输出 ELF 支持的扩展 + pr_info("riscv: ELF capabilities %s\n", print_str); +} +``` + +将上述代码简单总结一下: + +- S-mode 的 Linux 内核会通过 ACPI 或设备树中的 ISA string 得到 RISC-V ISA 扩展信息,但不是直接拿来用,而是会进行一些合法性检测与其它设置,例如:内核中 F 扩展和 D 扩展要么都启用,要么都不启用。 +- 内核最终输出的所支持的 ISA,是各个 hart 所支持的 ISA 的交集,并在此基础上应用一些 config 配置文件中有关 ISA 扩展的设置。 +- 未启用 ACPI 时,Linux 内核通过设备树中的 ISA string 获得扩展信息,此时 Zicsr,Zifencei,Zicntr,Zihpm 扩展会无条件启用。 +- 启用 ACPI 时,Linux 会尝试获取 ACPI 中用于向 OS 传递 CPU 信息的 RHCT(RISC-V Hart Capabilities Table)结构,如果存在 RHCT,就从中读取出 ISA string。 + +上述代码所检测的这些扩展在 bitmap 中所对应的位号定义在 `arch/riscv/include/asm/hwcap.h` 中,每个扩展对应一个位,其中低 26 位用于单字母扩展,多字母扩展对应的位号从 27 开始分配,bitmap 的最大容量为 64: + +```c +/* arch/riscv/include/asm/hwcap.h:16 */ + +#define RISCV_ISA_EXT_a ('a' - 'a') +#define RISCV_ISA_EXT_c ('c' - 'a') +#define RISCV_ISA_EXT_d ('d' - 'a') +#define RISCV_ISA_EXT_f ('f' - 'a') +#define RISCV_ISA_EXT_h ('h' - 'a') +#define RISCV_ISA_EXT_i ('i' - 'a') +#define RISCV_ISA_EXT_m ('m' - 'a') +#define RISCV_ISA_EXT_s ('s' - 'a') +#define RISCV_ISA_EXT_u ('u' - 'a') +#define RISCV_ISA_EXT_v ('v' - 'a') + +/* + * These macros represent the logical IDs of each multi-letter RISC-V ISA + * extension and are used in the ISA bitmap. The logical IDs start from + * RISCV_ISA_EXT_BASE, which allows the 0-25 range to be reserved for single + * letter extensions. The maximum, RISCV_ISA_EXT_MAX, is defined in order + * to allocate the bitmap and may be increased when necessary. + * + * New extensions should just be added to the bottom, rather than added + * alphabetically, in order to avoid unnecessary shuffling. + */ +#define RISCV_ISA_EXT_BASE 26 + +#define RISCV_ISA_EXT_SSCOFPMF 26 +#define RISCV_ISA_EXT_SSTC 27 +#define RISCV_ISA_EXT_SVINVAL 28 +#define RISCV_ISA_EXT_SVPBMT 29 +#define RISCV_ISA_EXT_ZBB 30 +#define RISCV_ISA_EXT_ZICBOM 31 +#define RISCV_ISA_EXT_ZIHINTPAUSE 32 +#define RISCV_ISA_EXT_SVNAPOT 33 +#define RISCV_ISA_EXT_ZICBOZ 34 +#define RISCV_ISA_EXT_SMAIA 35 +#define RISCV_ISA_EXT_SSAIA 36 +#define RISCV_ISA_EXT_ZBA 37 +#define RISCV_ISA_EXT_ZBS 38 +#define RISCV_ISA_EXT_ZICNTR 39 +#define RISCV_ISA_EXT_ZICSR 40 +#define RISCV_ISA_EXT_ZIFENCEI 41 +#define RISCV_ISA_EXT_ZIHPM 42 + +#define RISCV_ISA_EXT_MAX 64 +#define RISCV_ISA_EXT_NAME_LEN_MAX 32 + +#ifdef CONFIG_RISCV_M_MODE +#define RISCV_ISA_EXT_SxAIA RISCV_ISA_EXT_SMAIA +#else +#define RISCV_ISA_EXT_SxAIA RISCV_ISA_EXT_SSAIA +#endif +``` + +但解析 ISA string 面临着诸多问题,从上述代码也可以看出来对 ISA string 的解析比较繁琐且容易出错,相关讨论见 [邮件列表][005]: + +> There's been a bunch of off-list discussions about this, including at +> Plumbers. The original plan was to do something involving providing an +> ISA string to userspace, but ISA strings just aren't sufficient for a +> stable ABI any more: in order to parse an ISA string users need the +> version of the specifications that the string is written to, the version +> of each extension (sometimes at a finer granularity than the RISC-V +> releases/versions encode), and the expected use case for the ISA string +> (ie, is it a U-mode or M-mode string). That's a lot of complexity to +> try and keep ABI compatible and it's probably going to continue to grow, +> as even if there's no more complexity in the specifications we'll have +> to deal with the various ISA string parsing oddities that end up all +> over userspace. + +于是 Linux Kernel 又在用户空间引入了新的系统调用,详见下一小节。 + +## RISC-V Hardware Probing Interface + +上一小节中提到的 elf_hwcap 仅有 64 位可用,但用户空间需要检测的扩展可能不止 64 个,因此上述机制同样面临位数不够的问题。 + +为了解决这些问题,Linux 内核在 [commit ea3de9c][004] 中引入了一个用于在用户空间进行硬件检测的系统调用,相关文档见 + +`Documentation/riscv/hwprobe.rst`。该系统调用的参数包括一个键值对数组,键值对的个数,CPU 个数,CPU set 与一个 flag,目前支持检测 `m{arch,imp,vendor}id` 和少数 ISA 扩展,未来能够基于键值对参数进行更多的检测: + +```c +struct riscv_hwprobe { + __s64 key; + __u64 value; +}; + +long sys_riscv_hwprobe(struct riscv_hwprobe *pairs, size_t pair_count, size_t cpu_count, cpu_set_t *cpus, unsigned int flags); +``` + +其中与扩展检测相关的 C 代码如下: + +```c +/* arch/riscv/kernel/sys_riscv.c:125 */ + +static void hwprobe_isa_ext0(struct riscv_hwprobe *pair, + const struct cpumask *cpus) +{ + int cpu; + u64 missing = 0; + + pair->value = 0; + if (has_fpu()) + pair->value |= RISCV_HWPROBE_IMA_FD; + + if (riscv_isa_extension_available(NULL, c)) + pair->value |= RISCV_HWPROBE_IMA_C; + + if (has_vector()) + pair->value |= RISCV_HWPROBE_IMA_V; + + /* + * Loop through and record extensions that 1) anyone has, and 2) anyone + * doesn't have. + */ + for_each_cpu(cpu, cpus) { + struct riscv_isainfo *isainfo = &hart_isa[cpu]; + + if (riscv_isa_extension_available(isainfo->isa, ZBA)) + pair->value |= RISCV_HWPROBE_EXT_ZBA; + else + missing |= RISCV_HWPROBE_EXT_ZBA; + + if (riscv_isa_extension_available(isainfo->isa, ZBB)) + pair->value |= RISCV_HWPROBE_EXT_ZBB; + else + missing |= RISCV_HWPROBE_EXT_ZBB; + + if (riscv_isa_extension_available(isainfo->isa, ZBS)) + pair->value |= RISCV_HWPROBE_EXT_ZBS; + else + missing |= RISCV_HWPROBE_EXT_ZBS; + } + + /* Now turn off reporting features if any CPU is missing it. */ + pair->value &= ~missing; +} +``` + +上述这段代码检查 `IMAFDCV_Zba_Zbb_Zbs` 这些扩展是否受支持。其中 F 和 D 扩展是绑定的,Linux 内核中这两个扩展要么都启用,要么都关闭。 + +## 总结 + +本文通过对 `riscv_fill_hwcap` 函数的分析,介绍了 Linux 利用设备树/ACPI 检测 RISC-V ISA 扩展并生成 HWCAP 的方式,并介绍了 Linux 新引入的硬件检测系统调用 hw_probe。 + +2022 年,PLCT 曾经做过有关 RISC-V ISA 扩展检测机制的[调研][008],当时的 Linux 内核通过 M 模式传递来的设备树/SMBIOS/ACPI 中所包含的信息来在检测 ISA 扩展,而用户态可以通过 HWCAP/cpuinfo/环境变量/SIGILL 等方式对 RISC-V ISA 进行检测,但大部分方法本质上都是基于 ISA-string,对 ISA-string 的解析容易出错,且用户态缺少相应的硬件检测系统调用。 + +如今,用户空间的 RISC-V ISA 检测机制发生了一些变化,在 Linux v6.4 版本中引入了新的系统调用 hw_probe,用于检测硬件。目前这一系统调用支持的功能比较少,但它能够解决 HWCAP 位数不够用,用户空间缺少硬件检测的系统调用等问题。 + +## 参考资料 + +- [kernel.org][001] +- [RISC-V Linux 启动流程分析][002] +- [RISC-V 当前指令集扩展类别与检测方式][003] +- [commit ea3de9c: RISC-V: Add a syscall for HW probing][004] +- [引入 RISC-V Hardware Probing User Interface 的邮件讨论][005] +- [commit 6bcff51: RISC-V: Add bitmap reprensenting ISA features common across CPUs][006] +- [OpenSBI RISC-V ISA 扩展检测与支持方式分析][007] +- [PLCT 2022 年对 RISC-V ISA 扩展检测方式的调研][008] + +[001]: https://mirrors.edge.kernel.org/pub/linux/kernel/ +[002]: https://tinylab.org/riscv-linux-startup/ +[003]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20230715-riscv-isa-extensions-discovery-1.md +[004]: https://github.com/torvalds/linux/commit/ea3de9ce8aa280c5175c835bd3e94a3a9b814b74#diff-24372ab3ad2d22486b15d8a8f7e9e53a04e16efe0a392cec83786e24cb767bdd +[005]: https://lore.kernel.org/all/20230411-primate-rice-a5c102f90c6c@wendy/https://lore.kernel.org/all/20230411-primate-rice-a5c102f90c6c@wendy/ +[006]: https://github.com/torvalds/linux/commit/6bcff51539ccae5431a01f60293419dbae21100f +[007]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20230816-riscv-isa-discovery-4-opensbi.md +[008]: https://github.com/plctlab/PLCT-Open-Reports/blob/master/20220706-%E9%83%91%E9%88%9C%E5%A3%AC-discovery.pdf