diff --git a/PRODUCT_DOCS/assets/datop1.png b/PRODUCT_DOCS/assets/datop1.png new file mode 100644 index 0000000000000000000000000000000000000000..a798ff256c3f6abc0b1dfaa200bab76962adbb3b Binary files /dev/null and b/PRODUCT_DOCS/assets/datop1.png differ diff --git a/PRODUCT_DOCS/assets/datop2.png b/PRODUCT_DOCS/assets/datop2.png new file mode 100644 index 0000000000000000000000000000000000000000..bb5c5e8ec2ee954701e0c49a2b7209462b5fe610 Binary files /dev/null and b/PRODUCT_DOCS/assets/datop2.png differ diff --git a/PRODUCT_DOCS/assets/datop3.png b/PRODUCT_DOCS/assets/datop3.png new file mode 100644 index 0000000000000000000000000000000000000000..f863b7c1a111f35db9fc16742807cbf611520b29 Binary files /dev/null and b/PRODUCT_DOCS/assets/datop3.png differ diff --git a/PRODUCT_DOCS/maintainers.yaml b/PRODUCT_DOCS/maintainers.yaml index a89acef6796ca0d1f577fd24803268ff4a4caeae..4e79c033fd4bd64a18ad6ae25826133f3facea0b 100644 --- a/PRODUCT_DOCS/maintainers.yaml +++ b/PRODUCT_DOCS/maintainers.yaml @@ -7,7 +7,7 @@ maintainers: - openanolis_id: suli0002 gitee_id: suli01 - other_group: &other_group - - openanolis_id: + - openanolis_id: ~ gitee_id: yutting123 # 指定文档目录对应的用户组 paths: diff --git "a/PRODUCT_DOCS/test/1-HML\347\224\250\346\210\267\346\214\207\345\215\227\350\257\264\346\230\216.md" "b/PRODUCT_DOCS/test/1-HML\347\224\250\346\210\267\346\214\207\345\215\227\350\257\264\346\230\216.md" new file mode 100644 index 0000000000000000000000000000000000000000..497e8eb98e4a0f4aaaecd8b3e0c41e17a9f6c4ad --- /dev/null +++ "b/PRODUCT_DOCS/test/1-HML\347\224\250\346\210\267\346\214\207\345\215\227\350\257\264\346\230\216.md" @@ -0,0 +1,123 @@ +## HML + +### 简介 + +HML是基于海光CPU平台构建一套高性能的数学函数库,基于海光CPU架构特点极致优化,能够充分发挥海光CPU计算性能。 + +官方发布的 HML数学库位于 gitee 的 `hygon-devkit` 仓库,地址: + +```sh +git clone https://gitee.com/anolis/hygon-devkit.git +``` + +HML高性能数学库在hygon-devkit中的目录结构示意图如下: + +``` +hygon-devkit/ + ├─ hml + ├──pkg + │ └── hml_1.0.0 + │ + └── README.md + +``` + +* pkg目录:内含各版本hml高性能数学库。 +* README.md文件:有关HML的简单情况。 + + +### 安装与使用 + +#### 环境配置 + +确保gcc版本号>=8.5.0 +```sh +可以通过如下命令查看gcc版本号 +gcc -v +``` + +#### 安装 + +* 安装包命令规则 + +包命名规则如下: +hml----.. +如: +hml-1.0.0-2024-0320-rc.x86_64.rpm + +* RPM安装 +```sh +# 1. 安装 + sudo rpm -Uvh hml-1.0.0-2024-0320-rc.x86_64.rpm + +# 2. 检查是否安装成功 + rpm -qi hml + +# 3. 显示安装路径 + rpm -ql hml + +# 4. 卸载 + sudo rpm --verbose --erase hml +``` + +* DEB安装 +```sh +# 1. 检查DEB信息 + dpkg -I hml-1.0.0-2024-0320-rc.x86_64.deb + +# 2. 安装 + sudo dpkg -i hml-1.0.0-2024-0320-rc.x86_64.deb + +# 3. 检查安装是否成功 + dpkg -l hml + +# 4. 卸载 + sudo dpkg -r hml +``` + +* 安装路径 + HML库会被安装到/opt/hygon/目录下,包含blas、fft、smm、sparse、vml、vsip库。 + +#### 使用 + +* 添加HML库路径到环境变量 +```sh + export LD_LIBRARY_PATH=/opt/hygon:$LD_LIBRARY_PATH +``` + +* 链接HML库 +```sh +# 1. 链接BLAS库 + -L /opt/hygon/blas/lib -lblis-hg + +# 2. 链接FFT库 + -L /opt/hygon/fft/fftw_double/lib -lfftw3 + -L /opt/hygon/fft/fftw_single/lib -lfftw3 + +# 3. 链接SMM库 + -L /opt/hygon/smm/lib -lsmm-hg + +# 4. 链接SPARSE库 + -L /opt/hygon/sparse/lib -lhml_sparse-hg + +# 5. 链接VML库 + -L /opt/hygon/vml/lib -lvml-hg + +# 6. 链接VSIP库 + -L /opt/hygon/vsip/lib -lvsip-hg +``` + +### 使用指南 + +参考HML用户指南.pdf(https://gitee.com/anolis/hygon-devkit/blob/master/hml/HML%E7%94%A8%E6%88%B7%E6%8C%87%E5%8D%97.pdf) + +指南包含: + +* HML库介绍 +* HML库安装 +* HML_BLAS库接口和使用说明以及代码示例。 +* HML_SMM库接口和使用说明以及代码示例。 +* HML_SPARSE接口和使用说明以及代码示例。 +* HML_VML接口和使用说明以及代码示例。 +* HML_VSIP接口和使用说明以及代码示例。 + diff --git a/PRODUCT_DOCS/test/test1.md b/PRODUCT_DOCS/test/test1.md index 9427013e688ae858d83ca72a3c431b3df484ba56..232454fb4a03b845b99477fe90090937a4e489f5 100644 --- a/PRODUCT_DOCS/test/test1.md +++ b/PRODUCT_DOCS/test/test1.md @@ -1,3 +1,982 @@ -# test -__test2__eeeeehdwuieowiwjoiqw -看看这个test在不在 \ No newline at end of file +# Anolis OS Cloud Kernel: RAS White Paper + +## REVISION HISTORY + +| DATE | VERSION | DESCRIPTION | AUTHOR | APPROVER | +| ---------- | ------- | --------------- | ----------------------- | ----------- | +| 2022/12/31 | 1.0 | Initial version | Shuai Xue, Ruidong Tian | Baolin Wang | + +## Terms and Abbreviations + +| Abbreviation | Definition | +| ------------ | ----------------------------------------------------------------------------- | +| RAS | Reliability, Availability and Serviceability | +| SLA | Service Level Agreement | +| CE | Correctable Error | +| UCE | Uncorrected Correctable Error | +| MCA | Machine-Check Architecture | +| CMCI | Corrected Machine Check Interrupt | +| MCE | Machine-Check Exception | +| SEA | Synchronous External Abort | +| ELx | Exception levels are referred to as EL, with x as a number between 0 and 3 | +| ECC | Error Correction Code | +| SECDED | Single-bit Error Correction and Double-bit Error Detection | +| TF-A | Trusted Firmware-A | +| HEST | Hardware Error Source Table | +| GHES | Generic Hardware Error Source | + +## Abstract + +Reliability, availability and serviceability (RAS) is a computer hardware engineering term referring to the elimination of hardware failures to ensure maximum system uptime. + +This document describes the memory RAS features in detail, explaining how server availability is enhanced with the memory RAS features on Yitian 710 servers running Anolis OS Cloud Kernel. + +## Introduction + +The server is one of the key components of any modern data center infrastructure. It consists of a variety of hardware parts, including processors, storage devices, PCIe devices, power supplies, and fans. + +In today’s hyper scale Cloud Data centers, correct server operation and data integrity are critical to ensure service continuity. In other words, we must avoid data corruption no matter data is stored in any server component (memory, cache, or processor registers) or transmitted through any platform links (Intel®UPI, PCI Express, or DMI). + +Server reliability, availability, and serviceability (RAS) are crucial issues for modern enterprise IT shops that deliver mission-critical applications and services, as application delivery failures can be extremely costly per hour of system downtime. Although hardware failures are rare, they are inevitable but random events, especially for large scale data centers. If such incidents are not efficiently diagnosed, the consequences may be very serious and sometimes even catastrophic, such as data corruption or server crash. which are top concerns to meet SLAs (Service Level Agreement) for cloud end users. The likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments. Furthermore, considering today’s server system with more and more CPU cores shipped on hundreds of Virtual Machines (VM) and DDR DIMMs operating on it, the impact of server crash caused by hardware failures is much bigger than before. + +Modern CPU offers an extensive and robust set of RAS features in silicon to provide error detection, correction, containment, and recovery in all processors, memory, and I/O data paths based on Intel Machine Check Architecture (MCA) Recovery mechanism or ARM v8.2 RAS Extension. When a server component fails, OS with such RAS features is capable of recovery from hardware error, maximizing service availability and maintaining data integrity. + +## RAS Mechanism Overview + +### Error categories + +One of the most popular RAS schemes used in the memory subsystem is Error Correction Code (ECC) SECDED (Single-bit Error Correction and Double-bit Error Detection), which as its name indicates, the DDR controller can correct single-bit errors and detect double-bit errors on the received data from the DRAMs. + +Talking about detected hardware errors, we can classify memory errors as either corrected errors (CE) or uncorrected errors (UCE). + +- **Correctable Error (CE)** - the hardware error detection mechanism detected and automatically corrected the error. +- **Uncorrected errors (UCE)** - are severe enough, hardware detects but cannot correct. + +![MCA categories 2.png](../../assets/MCA_categories_2.png) + +Typically, uncorrectable errors further fall into three categories: + +- **Uncorrected Recoverable Errors (UCR)** - are uncorrected errors that have been detected and signaled but have not corrupted the processor context. For certain UCR errors, this means that once system software has performed a certain recovery action, it is possible to continue execution on this processor. UCR error reporting provides an error containment mechanism for data poisoning. It can be further divided into: + - **Action Required (AR)**: The error occurs in execution context. If such an error is detected, and the memory access has been architecturally executed, that error is considered “consumed”. CPU will signal a synchronous exception when an error is detected and the processor already consumes the memory. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. + - **Action Optional (AO)**: The error is detected out of processor execution context, e.g. when detected by a background scrubber or accessed by prefetch instruction. In this scenario, the data in the memory are corrupted, but OS is optional to take action to recover this uncorrectable error. +- **Uncorrected Error (UC)** - 2 bit (uncorrectable) error occurs and can not be corrected by hardware. The processor context is corrupted and cannot continue to operate the system. OS requires to panic immediately. + +OS will take specific actions based on the above failures. Handling CEs is done in silicon, e.g. using ECCs and can be made transparent to system. Handling DUEs, however, can require collaboration from higher layers in the hardware-software stack, from silicon to virtual memory manager, to the operating system (OS), and sometimes even the application layer. + +### X86 MCA Recovery + +The new Intel Xeon Scalable Family processors support recovery from some memory errors based on the Machine Check Architecture (MCA) Recovery mechanism. The figure shows a basic workflow with legacy MCA or EMCA. + +Prior to enhanced machine check architecture (EMCA), IA32-legacy version of Machine Check Architecture (MCA) implemented error handling where all the errors were logged in architected registers (MC banks) and signaled to OS/hypervisor. CMCI is signaled only when CE is over threshold and OS CMCI handler, aka, `threshold_interrupt` read MC Banks and other HW registers for further error handling. MCE is signaled when uncorrected or fatal errors are detected and its handler `do_machine_check` will poison the page and then kill current thread in memory failure. + +EMCA enables BIOS-based recovery from errors which redirects MCE and CMCI to firmware first (via SMI) before sending it to the OS error handler. It allows firmware first to handle, collect, and build enhanced error logs then report to system software. + +![ras_x86.png](../../assets/ras_x86.png) + +### ARM v8.2 RAS Extension + +The RAS Extension is a mandatory extension to the Armv8.2 architecture, and it is an optional extension to the Armv8.0 and Armv8.1 architectures. The figure shows a basic workflow with Firmware First mode. +![m1_ras_flow.png](../../assets/m1_ras_flow.png) + +- Prerequisite: System boot and init + + - Platform RAS driver init: BL31 initializes SPM (includes MM dispatcher) and SDEI dispatcher, UEFI query and update error source info in HEST + - OS RAS driver init: HEST driver scans HEST table and registers error handlers by APEI notification, e.g. SDEI, SEA, GPIO, etc. + +1. RAS event (UE or CE) occurred, the event will be routed to EL3 (SPM). +2. SPM routes the event to RAS error handler in S-EL0 (MM Foundation). +3. MM Foundation creates the CPER blobs by the info from RAS Extension. +4. SPM notifies RAS event through APEI notification, e.g. SDEI, SEA, etc. to call the corresponding OS registered handler. +5. OS gets the CPER blobs by Error Status Address block, processes the error, and tries to recover. +6. OS reports the error event by RAS tracepoints. +7. rasdaemon log error info from RAS event to recorder. + +For example, the platform specifies SDEI as an APEI notification to handle RAS events. As part of initialization, the kernel registers a handler for a platform event, enables the event, and unmasks the current PE. At a later point in time, a critical event, e.g. DDR UE interrupt is trapped into EL3. EL3 performs a first-level triage of the event, and a RAS component assumes further handling. The dispatch completes, but intends to involve Non-secure world UEFI in further handling, and therefore decides to explicitly dispatch an event (which the kernel had already registered for). + +## RAS Solution on ANCK + +Modern CPU offers an extensive and robust set of RAS features in silicon to provide error detection, correction, containment, and recovery in all processors, memory, and I/O data paths based on Intel Machine Check Architecture (MCA) Recovery mechanism or ARM v8.2 RAS Extension. The RAS mechanism is intended to assist CPU designers and CPU debuggers in diagnosing, isolating, and understanding processor failures. It is also intended to help system administrators detect transient and age-related failures, suffered during long-term operation of the server. + +To reduce systems downtime, the OS recovery process for ensuring reliable hardware performance is to detect and correct errors where possible, recover from uncorrectable errors through either physical or logical replacement of a failing component or data path, and prevent future errors by replacing in timely fashion components most likely to fail. + +The figure shows the system error handling flow with Anolis OS. + +![RAS_OS_Error_Flow.png](../../assets/RAS_OS_Error_Flow.png) + +### Memory Failure Recovery + +The RAS mechanism is used to detect, signal, and record machine fault information. Some of these faults are correctable, whereas others are uncorrectable. The Memory Failure Recovery capabilities of RAS mechanism allow systems to continue to operate when an uncorrected error is detected in the system. If not for these capabilities, the system would crash and might require hardware replacement or a system reboot. + +When an uncorrectable error is detected on a requested memory address, data poisoning is used to inform the CPU that the data requested has an uncorrectable error. When the hardware detects an uncorrectable memory error, it routes a poison bit along with the data to the CPU. For the Intel architecture, when the CPU detects this poison bit, it sends a processor interrupt signal to the operating system to notify it of this error. The operating system can then examine the uncorrectable memory error, determine if the software can recover, and perform recovery actions via an interrupt handler. + +Memory Failure Recovery handles UCR errors including: + +- AR are synchronous Errors. There are two types of such errors signaled as data abort or instruction abort. For example, data abort is detected by Data Cache Unit (DCU) and instruction abort is detected by Instruction Fetch Unit (IFU) which are both signaled as Machine Check Exception. The analogy exception is Synchronous External Abort in Arm64 platform. + +- AO are asynchronous Errors. Such errors are detected by memory patrol scrub, prefetch, Last Level Cache (LLC) explicit writeback transaction for X86 platform or store less than ECC protection granularity, e.g. per 64 bit on Neoverse N1 and N2. + +The kernel will attempt to hard-offline the page, by trying to unmap the page or killing any owner, or triggering IO errors if needed. This may kill any processes accessing the page. The kernel will avoid to access this page assuming it's poisoned by the hardware. +Let's dive into more details about Anolis OS Cloud Kernel running on Severs capable of Intel MCA Recovery or ARM v8.2 RAS Extension. + +#### User Space Action Required Recovery + +In Linux, user memory and kernel memory are independent and implemented in separate address spaces. The address spaces are virtualized, meaning that the addresses are abstracted from physical memory (through a process detailed shortly). In fact, the kernel itself resides in one address space, and each process resides in its own address space, so each process can be isolated completely and protected by the paging mechanism. These address spaces consist of virtual memory addresses, permitting many processes with independent address spaces to refer to a considerably smaller physical address space (the physical memory in the machine). Not only is this convenient, but it's also secure, because each address space is independent and isolated and therefore secure. One isolated address space per process is the basis of preventing the fault from being propagated to the enclosing scope or process. + +Without OS memory failure recovery and hardware data poisoning support, once a process is consuming poison, it will be regarded as a fatal event and the kernel will crash immediately. When the OS kernel receives the UCE events, the `memory_failure` function (HWPoison handler) analyzes the log to verify if recovery is feasible. It then takes actions to offline the affected memory page and logs the event in the +mcelog or RAS tracepoint, and the possible results of the actions appear to be ignoring, recovery, delay, and failure. + +The HWPoison handler starts the recovery action by isolating the affected page and declaring it with a “poisoned” tag to disallow any reuse of the page. In the case of an AR-instruction abort event, the HWPoison handler then reloads the 4KB page containing the instruction to a new physical page and resumes normal operation. In the case of an AR-data abort event, the HWPoison handler triggers a “SIGBUS” event to take further recovery action by notifying only the accessing process or any owner process which is configured by hwpoison-aware technique like prctl or early kill. The application has a choice to either reload the data and resume normal execution, or terminate the application to avoid crashing the entire system. + +![EL0_Recovery.png](../../assets/EL0_Recovery.png) + +#### Kernel Space Action Required Recovery + +The kernel itself resides in one address space, and contains a process scheduler, networking stack, virtual file system, and device drivers for hardware support, to name just a few, shared by all user space processes. When a user space application requires the services provided by the kernel, it will signal the kernel to execute a syscall, and switch to kernel mode for the duration of the syscall execution. In principle, if any UCE error was triggered while executing OS kernel code, then the UCE error will be fatal. +Kernel also provides user space memory access APIs for cross-space data movement from or to user memory. Cross-space data movements are limited to perform in Linux by special functions, defined in ``. Such a movement is either performed by a generic (memcpy-like) function or by functions optimized for a specific data size (char, short, int, long); The role of the data-movement functions is shown in following figure as it relates to the types involved for copy (simple vs. aggregate), note, not all user access API is showed. + +![uaccess.png](../../assets/uaccess.png) + +For example, when a user process tries to write a buffer to a file, kernel will copy the data from userspace and then write them to disk. If a UCE error occurs in the userspace buffer, kernel will consume the poison data while copying data from userspace. In such case, a system wide reboot is not unnecessary. The point behind Kernel Space Action Required Recovery is that the poison data manipulated by kernel is owned by the user process. If the application that initiated the copy and owned corrupt data can be easily identified by the kernel, it is possible to isolate the corrupt data by marking the affected page with the ‘poison’ tag and terminating the initiator/impacted applications to stop the corrupt data from spreading. + +The mechanism is to track uaccess in extable in advance and change pc to fixup handler while handling synchronous Errors. Then the uaccess will jump to fixup handler which then endups the uaccess process. If the exception is fixuped correctly, the kernel can avoid panic. In the copy from user case, e.g. initiated by write(2), it is not even necessary to send a SIGBUS. System calls should return -EFAULT or a short count for write(2). The Figure shows the basic workflow for Arm64 platform and the implementation of the X86 platform is similar. + +![EL2_Recovery_x86.png](../../assets/EL2_Recovery_x86.png) + +#### Action Optional Recovery: Patrol Scrub + +ECC Patrol Scrubber is a common block in DDR Controller (DDRC) capable of generating initialization write commands, periodic read commands, periodic RMW commands, and correction RMW commands. It proactively searches the system memory, repairing correctable errors. Periodic scrubbing is performed by the ECC Patrol Scrubber to prevent the accumulation of single-bit errors and increase the reliability of the system by correcting single-bit ECC errors in time, before they turn into uncorrectable 2-bit errors. +When an uncorrectable 2-bit error is detected by Patrol Scrubber, an interrupt will be signaled. In such case, kernel will just unmap the poisoned page because no process is accessing the poison data by default. + +On X86 old generation platform, after the patrol scrub detects memory uncorrected data errors, it will report the OS by MCE. The new generation like Intel® Xeon® Processor-based platforms have an `UCE_TO_CE_DOWNGRAGE` mode where the users can request the memory controller to report UCE found by the patrol scrubber as a corrected type. It is also called ‘downgrading patrol scrub CE/SRAO to CE’. Those errors are signaled by using CMCI, a process less disruptive than a machine check and thus helps avoid double MCE interrupts to crash the system. We recommend setting it on. + +![scrub_recovery](../../assets/Scrub_Recovery.png) + +#### Action Optional Recovery: Prefetch + +Many modern processors implement implicit hardware prefetching and support software prefetching. With software prefetching the programmer or compiler inserts prefetch instructions into the program. For example, Prefetch from Memory (`PRFM`) enables code to provide a hint to the memory system that data from a particular address will be used by the program soon. While for implicit hardware prefetching, the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. + +If a prefetch request accesses to poison data, an asynchronous error will be detected and an interrupt will be signaled, e.g. CMCI on Intel Icelake and SPI on Yitian 710. In such case, kernel will just unmap the poison page like Patrol Scrub error. + +Another prefetch scenario we observed is that the poisoned page may still be accessed even though all its owned user processes are killed. After a page is poisoned, it will never be reused, e.g. reallocated to other processes. The problem is that the poisoned page is only unmapped from the page table of user-space process, the kernel page table of the linear mapping range is not considered. It requires dynamically splitting the target linear mapping into PTE granularity and then clearing the PTE valid attribute of the related virtual address while processing memory failure. As a result, the poisoned page will be marked as not-present, which avoids speculative and prefetch access. + +#### Action Optional Recovery: Store + +Write is another type of request which may read the poison data from DDR controller. On Yitian 710, L2 cache is protected by a per 64-bit ECC scheme, a write less than 64bit will trigger asynchronous External Aborts, signaled as SErrors. Similarly, an asynchronous interrupt CMCI is signaled on X86 platform. In such case, it requires firmware to take extra care that does not notify kernel as a fatal error to avoid a system wide reboot. + +Unlike read access, write access does not cause error propagation. When such an error is detected, kernel will regard it as AO asynchronous error and only unmap the poisoned page. However, the write did not take effect, resulting in data loss. A subsequent 64-bit write access has the opportunity to correct this error. When the process trie to consume the poisoned page, the HWPoison handler triggers a “SIGBUS” event to take further recovery action by notifying only the accessing process or any owner process which is configured by hwpoison-aware technique like prctl or early kill. + +#### HWPoison-aware Strategy + +There are in principle two hwpoison-aware strategies to kill processes on poison: + +- just unmap the data and wait for an actual reference before killing +- kill all processes that have the corrupted and not reloadable page mapped as soon as the corruption is detected. + +Both have advantages and disadvantages and should be used in different situations. **Right now both are implemented and can be switched** with a new sysctl vm.memory_failure_early_kill. The default is late kill. Applications can override this setting individually with the PR_MCE_KILL prctl. For example, if early kill is set by `sysctl -w vm.memory_failure_early_kill=1`, kernel will kill any process which mapped the poison page when an uncorrectable 2-bit error is detected by Patrol Scrubber. + +Note, the kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can handle this if they want to. While for AR synchronous errors, the kill is done using a catchable SIGBUS with BUS_MCEERR_AR. + +### Memory Predictive Failure Analysis with Rasdeamon + +When a 1-bit error is detected, it is transparently corrected by the hardware ECC mechanism, and internal counters are updated. If a correctable fault occurs in the memory, we don't need to perform any recovery action on the OS. However, if we continue to see correctable errors, then perhaps the memory is failing. To avoid the possibility of future uncorrectable faults on the same page, we can copy the data to a different page and mark the page as offline. This is the mechanism used by Memory Predictive Failure Analysis (PFA). + +The PFA is powered by the userspace rasdaemon package. Rasdaemon written by Mauro Carvalho Chehab is one of the tools to gather MCE information. Previously, the task was performed by the mcelog package. However, the driver it depends on has been deprecated after kernel 4.12, we recommend switching to the new generation rasdaemon solution. + +If a memory error is detected and signaled, the OS related handler reports them to userspace through RAS tracepoints with EDAC decoded DIMM statistics for accounting and predictive failure analysis. Rasdeamon runs as a daemon that monitors the platform RAS reports from the Linux kernel trace events. And it optionally records RAS events via Sqlite3 which has the benefit of keeping a persistent record of the RAS events. Based on statistical results, some actions can be configured and taken to prevent corrected errors from evoluting into uncorrected errors. For example, specify soft offline action or hard offline action when exceeding a page error threshold within refresh cycles, e.g. 50 CEs perf 24 hours. When a soft action is specified, the kernel will then attempt to soft-offline it, by moving the contents elsewhere or dropping it if possible. The kernel will then be placed on the bad page list and never be reused. The page is still accessible, not poisoned. The kernel will never kill anything for this, but rather fail the offline. + +Note, the RAS feature is only covered but not limited to memory, the processor, PCIe, and Platform(e.g. CMN, GIC, SMMU, etc) RAS are also supported on Anolis OS Cloud Kernel. + +## RAS Validation Guide + +EINJ provides a hardware error injection mechanism. It is very useful for debugging and testing APEI and RAS features in general. In this white paper, we take Yitian 710 running Anolis OS as an example. Note that this guide is also suitable for other platforms with advanced RAS features. + +### Prerequisite + +#### BIOS Requirement + +You need to check whether your BIOS supports EINJ first. For Panjiu M Series equipped with Yitian 710, make ensure to set the following configuration properly. + +```bash +[Platform Configuration][Processor Configuration][CPU Poison] +[Platform Configuration][Memory RAS Configuration][Poison] +[Platform Configuration][Memory RAS Configuration][CE threshold ]<0> +[Platform Configuration][Memory RAS Configuration][Ecc] +[Platform Configuration][PCI-E Configuration][PCIe RAS Support] +[Platform Configuration][PCI-E Configuration][AER CE] +[Platform Configuration][Advance Configuration][Global RAS Enable] +[Platform Configuration][Advance Configuration][EINJ Enable] +[Platform Configuration][Advance Configuration][Route EA to El3] +``` + +#### OS Requirement + +Then, you need to check whether your BIOS supports EINJ. For that, look for early boot messages similar to this one, e.g. on Yitian 710 : + +```bash +#dmesg | grep EINJ +[ 0.000000] ACPI: EINJ 0x00000000F8FAFE18 000150 (v01 PTG PTG01 00000000 PTG 20200717) +``` + +which shows that the BIOS is exposing an EINJ table - it is the mechanism through which the injection is done. + +By default, the EINJ driver is built-in on Anolis OS. If you build kernel from scratch, make sure the following are options enabled in your kernel configuration: + +```shell +CONFIG_DEBUG_FS +CONFIG_ACPI_APEI +CONFIG_ACPI_APEI_EINJ +``` + +Check if the einj module is loaded: + +```shell +$ lsmod | grep einj +einj 16384 0 +``` + +If not, load the einj modules by yourself + +```shell +modprobe einj +``` + +### EINJ Interface + +The EINJ user interface is in \/apei/einj, by default, `/sys/kernel/debug/apei/einj`. + +```bash +#ls /sys/kernel/debug/apei/einj/ +available_error_type error_inject error_type flags notrigger param1 param2 param3 param4 vendor vendor_flags +``` + +The standard error types for the EINJ interface include Processor, Memory, PCIe, and Platform. The file `available_error_type`displays the supported standard error types and their severities, e.g. + +```bash +#cat /sys/kernel/debug/apei/einj/available_error_type +0x00000001 Processor Correctable +0x00000002 Processor Uncorrectable non-fatal +0x00000004 Processor Uncorrectable fatal +0x00000008 Memory Correctable +0x00000010 Memory Uncorrectable non-fatal +0x00000020 Memory Uncorrectable fatal +0x00000040 PCI Express Correctable +0x00000080 PCI Express Uncorrectable non-fatal +0x00000100 PCI Express Uncorrectable fatal +0x00000200 Platform Correctable +0x00000400 Platform Uncorrectable non-fatal +0x00000800 Platform Uncorrectable fatal +``` + +The error injection mechanism is a two-step process. + +- First select an error specified all necessary error parameters including`error_type`,`flags`,`param{1-4}`and `notrigger`,then write any integer to `error_inject` to inject the error. +- The second step performs some actions to trigger it. Setting `notrigger` to 1 skips the trigger phase, which may allow the user to cause the error in some other context by a simple access to the CPU, memory location, or device that is the target of the error injection. Setting `notrigger` to 0, the BIOS should trigger the error internally, e.g. by kicking the patrol scrubber. Whether this actually works depends on what operations the BIOS actually includes in the trigger phase. + +Please refer to the kernel document for more details about EINJ user interface format. + +#### Error Injection Examples with APEI Debugfs + +In this section, we show examples to inject errors with APEI Debugfs on Yitian 710. + +##### Processor Uncorrectable non-fatal + +```bash +APEI_IF=/sys/kernel/debug/apei/einj +echo 33 > $APEI_IF/param3 # APIC ID +echo 0x1 > $APEI_IF/flags +echo 0x00000002 > $APEI_IF/error_type +echo 1 > $APEI_IF/error_inject +``` + +The dmesg log: + +```bash +[ 1820.578688] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 +[ 1820.589434] {3}[Hardware Error]: event severity: recoverable +[ 1820.595078] {3}[Hardware Error]: precise tstamp: 2023-01-02 17:23:02 +[ 1820.601503] {3}[Hardware Error]: Error 0, type: recoverable +[ 1820.607147] {3}[Hardware Error]: section_type: ARM processor error +[ 1820.613485] {3}[Hardware Error]: MIDR: 0x00000000410fd490 +[ 1820.619041] {3}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081210000 +[ 1820.627723] {3}[Hardware Error]: running state: 0x1 +[ 1820.632759] {3}[Hardware Error]: Power State Coordination Interface state: 0 +[ 1820.639965] {3}[Hardware Error]: Error info structure 0: +[ 1820.645435] {3}[Hardware Error]: num errors: 1 +[ 1820.650037] {3}[Hardware Error]: error_type: 0, cache error +[ 1820.655854] {3}[Hardware Error]: error_info: 0x0000000000800015 +[ 1820.662019] {3}[Hardware Error]: transaction type: Instruction +[ 1820.668183] {3}[Hardware Error]: cache level: 2 +[ 1820.673045] {3}[Hardware Error]: the error has not been corrected +[ 1820.679470] {3}[Hardware Error]: type: CORE (0x41), ras_count:1 +[ 1820.685461] {3}[Hardware Error]: sub_type: 0x0 +[ 1820.689977] {3}[Hardware Error]: fr: 0x10a9a2, ctrl: 0x0, status: 0x44800007, addr: 0x800e9f716acea53d +[ 1820.699352] {3}[Hardware Error]: misc0: 0x4, misc1: 0x0, misc2: 0x0, misc3: 0x0 +``` + +##### Processor Uncorrectable fatal + +Script to inject and trigger processor uncorrectable fatal error. Note, a fatal error will cause the kernel to panic. + +```bash +APEI_IF=/sys/kernel/debug/apei/einj +echo 33 > $APEI_IF/param3 # APIC ID +echo 0x1 > $APEI_IF/flags +echo 0x00000004 > $APEI_IF/error_type +echo 1 > $APEI_IF/error_inject +``` + +The dmesg log: + +```bash +[10862.838686] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3 +[10862.838687] {10}[Hardware Error]: event severity: fatal +[10862.838688] {10}[Hardware Error]: precise tstamp: 2023-01-02 19:53:43 +[10862.838688] {10}[Hardware Error]: Error 0, type: fatal +[10862.838688] {10}[Hardware Error]: section_type: ARM processor error +[10862.838689] {10}[Hardware Error]: MIDR: 0x00000000410fd490 +[10862.838689] {10}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081210000 +[10862.838689] {10}[Hardware Error]: running state: 0x1 +[10862.838690] {10}[Hardware Error]: Power State Coordination Interface state: 0 +[10862.838690] {10}[Hardware Error]: Error info structure 0: +[10862.838691] {10}[Hardware Error]: num errors: 1 +[10862.838691] {10}[Hardware Error]: error_type: 0, cache error +[10862.838691] {10}[Hardware Error]: error_info: 0x0000000000800015 +[10862.838692] {10}[Hardware Error]: transaction type: Instruction +[10862.838692] {10}[Hardware Error]: cache level: 2 +[10862.838693] {10}[Hardware Error]: the error has not been corrected +[10862.838693] {10}[Hardware Error]: type: CORE (0x41), ras_count:1 +[10862.838693] {10}[Hardware Error]: sub_type: 0x0 +[10862.838694] {10}[Hardware Error]: fr: 0x10a9a2, ctrl: 0x0, status: 0x74000007, addr: 0x800e9f716acea53d +[10862.838694] {10}[Hardware Error]: misc0: 0x4, misc1: 0x0, misc2: 0x0, misc3: 0x0 +[10862.838695] Kernel panic - not syncing: Fatal hardware error! +``` + +#### Memory + +##### Correctable + +Firstly, run a `victim` program in the background. The `victim` is one of the ras-tools which allocates a page in userspace and dumps the virtual and physical address of the page, and then holds on to trigger. + +```bash +#victim -d & +[1] 12472 +physical address of (0xffff87fb2000) = 0x89a0f8000 +Hit any key to trigger error: +[1]+ Stopped victim -d +``` + +Then run the bellow script to inject and trigger memory correct error. Note, the CE recovery is usually implemented as a threshold based error reporting mechanism. The default threshold for CE is 5000, in other words, the hardware only signal interrupt per 5000 CE errors. To test the feature, we configure CE threshold as 0. + +```bash +echo 0x89a0f8000 > $APEI_IF/param1 +echo 0xfffffffffffff000 > $APEI_IF/param2 +echo 0x1 > $APEI_IF/flags +echo 0x00000008 > $APEI_IF/error_type +echo 1 > $APEI_IF/error_inject +``` + +The dmesg log: + +```bash +[ 1555.991595] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank_group:4 bank_address:2 device:0 row:616 column:1024 chip_id:0 page:0x89a0f8 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank_group:4 bank_address:2 device:0 row:616 column:1024 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +[ 1555.991600] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 1555.991602] {1}[Hardware Error]: It has been corrected by h/w and requires no further action +[ 1555.991602] {1}[Hardware Error]: event severity: corrected +[ 1555.991604] {1}[Hardware Error]: precise tstamp: 2023-01-02 17:18:38 +[ 1555.991604] {1}[Hardware Error]: Error 0, type: corrected +[ 1555.991606] {1}[Hardware Error]: section_type: memory error +[ 1555.991606] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 1555.991607] {1}[Hardware Error]: physical_address: 0x000000089a0f8000 +[ 1555.991608] {1}[Hardware Error]: node:0 card:0 module:0 rank:0 bank_group:4 bank_address:2 device:0 row:616 column:1024 chip_id:0 +[ 1555.991609] {1}[Hardware Error]: error_type: 4, single-symbol chipkill ECC +[ 1555.991610] {1}[Hardware Error]: type: DDR (0x50), ras_count:1 +[ 1555.991611] {1}[Hardware Error]: sub_type: 0x0 +[ 1555.991612] {1}[Hardware Error]: fr: 0x1000200000022, ctrl: 0x0, status: 0x0, addr: 0x0 +[ 1555.991612] {1}[Hardware Error]: misc0: 0x0, misc1: 0x0, misc2: 0x200000000000000, misc3: 0x900000000000000 +``` + +##### Memory UnCorrectable Non-fatal + +Firstly, run a `victim` program in the background as the last section described. + +```bash +#victim -d & +physical address of (0xffff962d0000) = 0x9f8acb000 +Hit any key to trigger error: +[1]+ Stopped victim -d +``` + +Then run the bellow script to inject and trigger memory correct error. Here, we specify `notrigger` to 0 to let the firmware kick the DDRC scrubber to trigger the error. + +```bash +APEI_IF=/sys/kernel/debug/apei/einj +echo 0x400a4919000 > $APEI_IF/param1 +echo 0xfffffffffffff000 > $APEI_IF/param2 +echo 0x1 > $APEI_IF/flags +echo 0x00000010 > $APEI_IF/error_type +echo 0x0 > $APEI_IF/notrigger +echo 1 > $APEI_IF/error_inject +``` + +The dmesg log: + +```bash +[ 211.121855] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 211.132646] {1}[Hardware Error]: event severity: recoverable +[ 211.138292] {1}[Hardware Error]: precise tstamp: 2022-12-30 15:26:40 +[ 211.144717] {1}[Hardware Error]: Error 0, type: recoverable +[ 211.150362] {1}[Hardware Error]: section_type: memory error +[ 211.156096] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 211.165125] {1}[Hardware Error]: physical_address: 0x00000400a4919000 +[ 211.171725] {1}[Hardware Error]: node:0 card:7 module:0 rank:0 bank_group:7 bank_address:0 device:0 row:146 column:1152 chip_id:0 +[ 211.183619] {1}[Hardware Error]: error_type: 14, scrub uncorrected error +[ 211.190479] {1}[Hardware Error]: type: DDR (0x50), ras_count:1 +[ 211.196383] {1}[Hardware Error]: sub_type: 0x0 +[ 211.200899] {1}[Hardware Error]: fr: 0x1000200000353, ctrl: 0x0, status: 0x0, addr: 0x0 +[ 211.208974] {1}[Hardware Error]: misc0: 0x0, misc1: 0x0, misc2: 0x0, misc3: 0x200000000000500 +[ 211.218375] Memory failure: 0x400a4919: recovery action for dirty LRU page: Recovered +``` + +At this point, the allocated physical page is unmapped and poisoned, any read access will trigger a page fault. +If we move the background process `victim`on current Linux shell to the foreground and hit any key, the victim will trigger a page fault and receive a SIGBUS signal due to the poisoned PTE entry. Because the `victim`process does not register the SIGBUS handler, it will be killed. + +```bash +#fg +victim -d + +Access time at Fri Dec 30 15:38:14 2022 + +Bus error +``` + +We can also specify `notrigger` to 1 to let the firmware skip the trigger phase and allow the `victim` process to access the target of the error injection so that the error will be detected in execution context. + +Firstly, select a page and inject an error to it, while explicitly skipping the firmware trigger phase. + +```bash +#victim -d & +[1] 9522 +physical address of (0xffffaed6d000) = 0x400aa6dd000 +Hit any key to trigger error: +[1]+ Stopped victim -d + +APEI_IF=/sys/kernel/debug/apei/einj + +echo 0x400aa6dd000 > $APEI_IF/param1 +echo 0xfffffffffffff000 > $APEI_IF/param2 +echo 0x1 > $APEI_IF/flags +echo 0x00000010 > $APEI_IF/error_type +echo 0x1 > $APEI_IF/notrigger +echo 1 > $APEI_IF/error_inject +``` + +Then move the background process `victim` on current Linux shell to the foreground and hit any key, so that the error will be triggered in execution context. The kernel will poison the page and unmap it, then send SIGBUS to the process which accesses the page. + +```bash +#fg +victim -d + +Access time at Fri Dec 30 15:39:26 2022 + +Bus error +``` + +The dmesg log: + +```bash +[ 799.958832] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 799.969533] {3}[Hardware Error]: event severity: recoverable +[ 799.975179] {3}[Hardware Error]: precise tstamp: 2022-12-30 15:36:29 +[ 799.981603] {3}[Hardware Error]: Error 0, type: recoverable +[ 799.987248] {3}[Hardware Error]: section_type: memory error +[ 799.992978] {3}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 800.002007] {3}[Hardware Error]: physical_address: 0x00000400aa6dd000 +[ 800.008607] {3}[Hardware Error]: node:0 card:5 module:0 rank:1 bank_group:1 bank_address:0 device:0 row:169 column:1664 chip_id:0 +[ 800.020500] {3}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC +[ 800.027446] {3}[Hardware Error]: type: DDR (0x50), ras_count:1 +[ 800.033351] {3}[Hardware Error]: sub_type: 0x0 +[ 800.037866] {3}[Hardware Error]: fr: 0x1001000100000000, ctrl: 0xf000000000920004, status: 0xd800000Cor0040](0xadd040000d0receiveaecntr=526(d1.subch3), cnt=0x1 +[ 800.060436] {3}[Hardware Error]: misc0: 0x3f00000000040307, misc1: 0xd00000000030cd18, misc2: 0x4015, misc3: 0x200000000000100 +[ 800.072366] Memory failure: 0x400aa6dd: recovery action for dirty LRU page: Recovered +``` + +### RAS-tools + +We can also test and validate RAS features of whole system stack across hardware, firmware and OS via ras-tools. Ras-tools are an excellent set of tools to inject and test RAS ability on X86 and Arm64 platforms based on the APEI EINJ interface. + +| tools | fatal | arch | Description | Usage | +| ----------- | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------ | +| einj_mem_uc | See help | x86、Arm | inject an error and then trigger it in one of a variety of ways. | ./einj_mem_uc # See help for testname | +| cmcistorm | No | x86 | use EINJ to inject a bunch of soft errors, then consume them all as fast as possible. | ./cmcistorm # e.g./cmcistorm 20 1 | +| hornet | No | x86、Arm | Start a process (or point to an existing one) and inject an uncorrectable memory error to a targeted or randomly chosen memory address | ./hornet -p PID | +| lmce | No | x86 | local mce | ./lmce | +| mca-recover | No | x86、Arm | Set up to get zapped by a machine check (injected elsewhere) recovery function reports physical address of new page - so we can inject to that and repeat over and over. | ./mca-recover | +| rep_ce_page | No | x86、Arm | loop using EINJ to inject a soft error, consuming after each until the page is taken offline. | ./rep_ce_page | +| vtop | No | x86、Arm | Given a process if and virtual address, dig around in /proc/id/pagemap to find the physical address (if present) behind the virtual one. | ./vtop | +| memattr | No | Arm | Example of the Linux kernel driver that allows a user-space program to mmap a buffer of contiguous physical memory with specific memory attribute. | cd pgprot-drv
make
insmod pgprot_drv.ko pgprot=4
../memattr| +| ras-tolerance | No | Arm | This driver allows to overwrite error severity to a lower level at runtime, recoverable by default. It is useful for test. | cd ras-tolerance
make
insmod ras_tolerance.ko| + +#### Install + +On servers running Anolis OS, you can install ras-tools through `yum`. On other OSes, you could build it from scratch. + +``` bash +yum install ras-tools +``` + +#### Memory Failure Recovery Validation + +The `einj_mem_uc` tool allocates pages, injects an error and then triggers it in one of a variety of ways. It intends to do a coverage test for testing the Linux RAS related features, including CPU/Memory error containment and recovery. + +##### AR Validation + +###### User Space AR-data Recovery + +In the case of an AR-data abort event e.g. `single`, `doube`,`split`,`hugetlb`,etc, the kernel will attempt to hard-offline the page, by poisoning the page and killing accessing process. For example, `single` case, it injects an uncorrected error and triggers the error by reading a byte. + +```bash +# einj_mem_uc single +0: single vaddr = 0xffff857a3400 paddr = 8e6157400 +injecting ... +triggering ... +signal 7 code 4 addr 0xffff857a3400 +page not present +Test passed +``` + +`einj_mem_uc` will print the received signal and its code, in the above case, + +- signal 7: SIGBUS +- code 4: BUS_MCEERR_AR 4 + +The dmesg log: + +```bash +[ 1785.908893] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank_group:1 bank_address:2 device:0 row:920 column:896 chip_id:0 page:0x8e6157 offset:0x400 grain:1 - APEI location: node:0 card:0 module:0 rank:0 bank_group:1 bank_address:2 device:0 row:920 column:896 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +[ 1785.908900] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 1785.919531] {1}[Hardware Error]: event severity: recoverable +[ 1785.925176] {1}[Hardware Error]: precise tstamp: 2023-01-17 18:05:09 +[ 1785.931600] {1}[Hardware Error]: Error 0, type: recoverable +[ 1785.937244] {1}[Hardware Error]: section_type: memory error +[ 1785.942975] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 1785.952004] {1}[Hardware Error]: physical_address: 0x00000008e6157400 +[ 1785.958603] {1}[Hardware Error]: node:0 card:0 module:0 rank:0 bank_group:1 bank_address:2 device:0 row:920 column:896 chip_id:0 +[ 1785.970409] {1}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC +[ 1785.977355] {1}[Hardware Error]: type: DDR (0x50), common_reg_nr:1 +[ 1785.983606] {1}[Hardware Error]: Synchronous Exception taken in EL0 +[ 1785.989944] {1}[Hardware Error]: ESR: 0x92000410, ELR: 0x403abc, FAR: 0xfa00a88, SCR: 0x403073d, SCTLR: 0x30cd183f, LR: 0x403abc +[ 1786.001578] {1}[Hardware Error]: ECCERRCNT: 0x10000, ECCSTAT: 0x0, ADVECCSTAT: 0x8000002, ECCSYMBOL: 0x170000, ECCERRCNTSTAT: 0x0, ECCERRCNT0: 0x0, ECCERRCNT1: 0x0, ECCCADDR0: 0x0, ECCCADDR1: 0x0, ECCCDATA0: 0x0, ECCCDATA1: 0x0, ECCUADDR0: 0x398, ECCUADDR1: 0x1020380, ECCUDATA0: 0x1ff, ECCUDATA1: 0x0 +[ 1786.036640] Memory failure: 0x8e6157: recovery action for dirty LRU page: Recovered + +``` + +###### User Space AR-instruction Recovery + +In the case of an AR-instruction abort event, e.g. `instr`, it injects an uncorrected error and triggers the error by reading a byte. The kernel will attempt to hard-offline the page. It unmaps the corrupted page, reloads the 4KB page containing the instruction to a new physical page and resumes normal operation. + +```bash +# einj_mem_uc instr +0: instr vaddr = 0x403000 paddr = 8bba93000 +injecting ... +triggering ... +Test passed +``` + +The dmesg log: + +```bash +[ 1945.804589] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:7 module:0 rank:1 bank_group:1 bank_address:3 device:0 row:527 column:640 chip_id:0 page:0x40883e65 offset:0x0 grain:1 - APEI location: node:0 card:7 module:0 rank:1 bank_group:1 bank_address:3 device:0 row:527 column:640 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +[ 1945.804596] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 1945.815209] {3}[Hardware Error]: event severity: recoverable +[ 1945.820854] {3}[Hardware Error]: precise tstamp: 2023-01-17 18:07:49 +[ 1945.827280] {3}[Hardware Error]: Error 0, type: recoverable +[ 1945.832924] {3}[Hardware Error]: section_type: memory error +[ 1945.838654] {3}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 1945.847683] {3}[Hardware Error]: physical_address: 0x0000040883e65000 +[ 1945.854283] {3}[Hardware Error]: node:0 card:7 module:0 rank:1 bank_group:1 bank_address:3 device:0 row:527 column:640 chip_id:0 +[ 1945.866089] {3}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC +[ 1945.873035] {3}[Hardware Error]: type: DDR (0x50), common_reg_nr:1 +[ 1945.879286] {3}[Hardware Error]: Synchronous Exception taken in EL0 +[ 1945.885625] {3}[Hardware Error]: ESR: 0x82000010, ELR: 0x403000, FAR: 0x403000, SCR: 0x403073d, SCTLR: 0x30cd183f, LR: 0x403f94 +[ 1945.906459] {3}[Hardware Error]: ECCERRCNT: 0x10000, ECCSTAT: 0x0, ADVECCSTAT: 0x8000002, ECCSYMBOL: 0x140000, ECCERRCNTSTAT: 0x0, ECCERRCNT0: 0x0, ECCERRCNT1: 0x0, ECCCADDR0: 0x0, ECCCADDR1: 0x0, ECCCDATA0: 0x0, ECCCDATA1: 0x0, ECCUADDR0: 0x100020f, ECCUADDR1: 0x1030280, ECCUDATA0: 0x1ff, ECCUDATA1: 0x0 +[ 1945.934071] Memory failure: 0x40883e65: corrupted page was clean: dropped without side effects +[ 1945.934084] Memory failure: 0x40883e65: recovery action for clean LRU page: Recovered +``` + +###### Kernel Space AR Recovery + +Kernel Space AR Recovery is only supported on X86 platform and we are still working on it on Arm64 platform. The recovery is evaluated on X86 icelake processor. + +First, inject an uncorrected error and trigger it by writing a buffer to a file. Kernel will copy data from user space and then write to disk. + +```bash +# einj_mem_uc copyin -f +0: copyin vaddr = 0x7f8f873e2400 paddr = 2869c1400 +injecting ... +triggering ... +einj_mem_uc: couldn't write temp file (errno=14) +Big surprise ... still running. Thought that would be fatal +Saw local machine check +Test passed +``` + +As we can see, the process is still running and the return errno for the write(2) is EFAULT(14). + +The dmesg log: + +```bash +SetMemoryDeviceStatus UCE error. Data = 00 4C A5 01 02 00 06 01 05 00 00 00 00 00 Status = Success +[15322.535921] mce: Kernel accessed poison in user space at 2869c1400 +[15322.536023] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 +[15322.542117] Memory failure: 0x2869c1: recovery action for dirty LRU page: Recovered +[15322.550382] {2}[Hardware Error]: event severity: recoverable +[15322.550385] {2}[Hardware Error]: Error 0, type: recoverable +[15322.558042] Memory failure: 0x2869c1: already hardware poisoned +[15322.563710] {2}[Hardware Error]: fru_text: Card02, ChnF, DIMM0 +[15322.563712] {2}[Hardware Error]: section_type: memory error +[15322.586981] {2}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[15322.596027] {2}[Hardware Error]: physical_address: 0x00000002869c1400 +[15322.602650] {2}[Hardware Error]: node:1 card:5 module:0 rank:0 bank:13 device:0 row:2075 column:8 +[15322.611783] {2}[Hardware Error]: error_type: 3, multi-bit ECC +[15322.617710] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 +[15322.625304] Memory failure: 0x2869c1: already hardware poisoned +[15322.631827] EDAC MC6: 1 UE memory read error on CPU_SrcID#1_MC#2_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2869c1 offset:0x400 grain:32 - err_code:0x00a0:0x0091 SystemAddress:0x2869c1400 ProcessorSocketId:0x1 MemoryControllerId:0x2 ChannelAddress:0x2069c000 ChannelId:0x1 RankAddress:0x1034e000 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x81b Column:0x8 Bank:0x1 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) +[15322.667403] EDAC MC6: 1 UE memory read error on CPU_SrcID#1_MC#2_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2869c1 offset:0x400 grain:32 - err_code:0x0000:0x009f SystemAddress:0x2869c1400 ProcessorSocketId:0x1 MemoryControllerId:0x2 ChannelAddress:0x2069c000 ChannelId:0x1 RankAddress:0x1034e000 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x81b Column:0x8 Bank:0x1 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) +``` + +futex(2) is another system call in which kernel copies data from user space. Inject an uncorrected error and trigger it by issuing `FUTEX_WAIT` operation. + +```bash +# einj_mem_uc futex -f +0: futex vaddr = 0x7f8a1da83400 paddr = 25751d400 +injecting ... +triggering ... +futex returned with errno=14 +Big surprise ... still running. Thought that would be fatal +Unusual number of MCEs seen: 2 +Test passed +``` + +There are many retries in futex(2) mechanism, so it is possible to see many MCEs. + +The dmesg log: + +```bash +SetMemoryDeviceStatus UCE error. Data = 00 4C A5 01 02 00 06 01 05 00 00 00 00 00 Status = Success +[15521.242381] mce: Kernel accessed poison in user space at 25751d400 +[15521.242437] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 +[15521.248581] Memory failure: 0x25751d: recovery action for dirty LRU page: Recovered +[15521.256842] {4}[Hardware Error]: event severity: recoverable +[15521.256845] {4}[Hardware Error]: Error 0, type: recoverable +[15521.256847] {4}[Hardware Error]: fru_text: Card02, ChnF, DIMM0 +[15521.264506] Memory failure: 0x25751d: already hardware poisoned +[15521.270172] {4}[Hardware Error]: section_type: memory error +[15521.270173] {4}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[15521.270174] {4}[Hardware Error]: physical_address: 0x000000025751d400 +[15521.309103] {4}[Hardware Error]: node:1 card:5 module:0 rank:0 bank:4 device:0 row:1882 column:896 +[15521.318322] {4}[Hardware Error]: error_type: 3, multi-bit ECC +[15521.324252] {4}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 +[15521.331824] {4}[Hardware Error]: Error 1, type: recoverable +[15521.337484] {4}[Hardware Error]: section_type: memory error +[15521.343240] {4}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[15521.352286] {4}[Hardware Error]: physical_address: 0x000000025751d400 +[15521.358910] {4}[Hardware Error]: node:1 +[15521.363017] {4}[Hardware Error]: error_type: 3, multi-bit ECC +[15521.369040] Memory failure: 0x25751d: already hardware poisoned +[15521.374974] Memory failure: 0x25751d: already hardware poisoned +[15521.381515] EDAC MC6: 1 UE memory read error on CPU_SrcID#1_MC#2_Chan#1_DIMM#0 (channel:1 slot:0 page:0x25751d offset:0x400 grain:32 - err_code:0x00a0:0x0091 SystemAddress:0x25751d400 ProcessorSocketId:0x1 MemoryControllerId:0x2 ChannelAddress:0x1d751c00 ChannelId:0x1 RankAddress:0xeba9c00 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x75a Column:0x380 Bank:0x0 BankGroup:0x1 ChipSelect:0x0 ChipId:0x0) +[15521.417060] EDAC MC6: 1 UE memory read error on CPU_SrcID#1_MC#2_Chan#1_DIMM#0 (channel:1 slot:0 page:0x25751d offset:0x400 grain:32 - err_code:0x0000:0x009f SystemAddress:0x25751d400 ProcessorSocketId:0x1 MemoryControllerId:0x2 ChannelAddress:0x1d751c00 ChannelId:0x1 RankAddress:0xeba9c00 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x75a Column:0x380 Bank:0x0 BankGroup:0x1 ChipSelect:0x0 ChipId:0x0) +[15521.452740] EDAC MC6: 1 UE memory read error on CPU_SrcID#1_MC#2_Chan#1_DIMM#0 (channel:1 slot:0 page:0x25751d offset:0x400 grain:32 - err_code:0x0000:0x009f SystemAddress:0x25751d400 ProcessorSocketId:0x1 MemoryControllerId:0x2 ChannelAddress:0x1d751c00 ChannelId:0x1 RankAddress:0xeba9c00 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x75a Column:0x380 Bank:0x0 BankGroup:0x1 ChipSelect:0x0 ChipId:0x0) +``` + +##### AO Validation + +###### AO Patrol Recovery + +In the case of an AO event e.g. `patrol`, the kernel will attempt to hard-offline the page, by just poisoning and unmapping the page. Inject and trigger patrol error. Note, in this section, the HWPoison-aware strategy is default late kill. + +```bash +# einj_mem_uc patrol +0: patrol vaddr = 0xffff9d523400 paddr = 400a2575400 +injecting ... +triggering ... +page not present +Test passed +``` + +The dmesg log: + +```bash +[ 2026.290450] EDAC MC0: 1 UE scrub uncorrected error on unknown memory (node:0 card:6 module:0 rank:0 bank_group:2 bank_address:3 device:0 row:137 column:640 chip_id:0 page:0x400a2575 offset:0x400 grain:1 - APEI location: node:0 card:6 module:0 rank:0 bank_group:2 bank_address:3 device:0 row:137 column:640 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +[ 2026.290460] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 2026.301258] {4}[Hardware Error]: event severity: recoverable +[ 2026.306903] {4}[Hardware Error]: precise tstamp: 2023-01-17 18:09:10 +[ 2026.313328] {4}[Hardware Error]: Error 0, type: recoverable +[ 2026.318972] {4}[Hardware Error]: section_type: memory error +[ 2026.324703] {4}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 2026.333732] {4}[Hardware Error]: physical_address: 0x00000400a2575400 +[ 2026.340331] {4}[Hardware Error]: node:0 card:6 module:0 rank:0 bank_group:2 bank_address:3 device:0 row:137 column:640 chip_id:0 +[ 2026.352138] {4}[Hardware Error]: error_type: 14, scrub uncorrected error +[ 2026.358998] {4}[Hardware Error]: type: DDR (0x50), common_reg_nr:1 +[ 2026.365249] {4}[Hardware Error]: Interrupt: 843 +[ 2026.369852] {4}[Hardware Error]: ECCERRCNT: 0x40000, ECCSTAT: 0x0, ADVECCSTAT: 0x88000002, ECCSYMBOL: 0xec0000, ECCERRCNTSTAT: 0x0, ECCERRCNT0: 0x0, ECCERRCNT1: 0x0, ECCCADDR0: 0x0, ECCCADDR1: 0x0, ECCCDATA0: 0x0, ECCCDATA1: 0x0, ECCUADDR0: 0x89, ECCUADDR1: 0x2030280, ECCUDATA0: 0x1ff, ECCUDATA1: 0x0 +[ 2026.397264] Memory failure: 0x400a2575: recovery action for dirty LRU page: Recovered +``` + +###### AO Prefetch Recovery + +First, inject an uncorrected error and trigger it by explicitly performing a `prfm`. The platform will signal an interrupt. + +```bash +#einj_mem_uc prefetch +0: prefetch vaddr = 0xffffbe03f400 paddr = 8c17eb400 +injecting ... +triggering ... +page not present +Test passed +``` + +The dmesg log: + +```bash +[ 7616.802823] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 7616.813922] {1}[Hardware Error]: event severity: recoverable +[ 7616.819566] {1}[Hardware Error]: Error 0, type: recoverable +[ 7616.825210] {1}[Hardware Error]: section_type: memory error +[ 7616.830940] {1}[Hardware Error]: error_status: 0x0000000000000400 +[ 7616.837191] {1}[Hardware Error]: physical_address: 0x00000008c17eb400 +[ 7616.843791] {1}[Hardware Error]: node: 0 card: 0 module: 0 rank: 1 bank_group: 3 bank_address: 0 device: 0 row: 773 column: 1408 +[ 7616.855597] {1}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC +[ 7616.862543] {1}[Hardware Error]: type: DDR (0x50), ras_count:1 +[ 7616.868447] {1}[Hardware Error]: sub_type: 0x0 +[ 7616.872962] {1}[Hardware Error]: fr: 0x1000200000026, ctrl: 0x0, status: 0x0, addr: 0x0 +[ 7616.881036] {1}[Hardware Error]: misc0: 0x0, misc1: 0x0, misc2: 0x0, misc3: 0x200000000000100 +[ 7616.889888] Memory failure: 0x8c17eb: recovery action for dirty LRU page: Recovered +``` + +###### AO Store Recovery + +First, inject an uncorrected error and trigger it by writing a byte. The write size is less than 64 bits and the platform will signal a SError. + +```bash +# einj_mem_uc strbyte +0: strbyte vaddr = 0xffffa3651400 paddr = 400afd01400 +injecting ... +triggering ... +page not present +Test passed +``` + +The dmesg log: + +```bash +[ 2378.241939] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:5 module:0 rank:0 bank_group:2 bank_address:1 device:0 row:191 column:128 chip_id:0 page:0x400afd01 offset:0x400 grain:1 - APEI location: node:0 card:5 module:0 rank:0 bank_group:2 bank_address:1 device:0 row:191 column:128 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +[ 2378.241945] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 2378.252573] {5}[Hardware Error]: event severity: recoverable +[ 2378.258217] {5}[Hardware Error]: precise tstamp: 2023-01-17 18:15:02 +[ 2378.264642] {5}[Hardware Error]: Error 0, type: recoverable +[ 2378.270286] {5}[Hardware Error]: section_type: memory error +[ 2378.276017] {5}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 2378.285045] {5}[Hardware Error]: physical_address: 0x00000400afd01400 +[ 2378.291644] {5}[Hardware Error]: node:0 card:5 module:0 rank:0 bank_group:2 bank_address:1 device:0 row:191 column:128 chip_id:0 +[ 2378.303451] {5}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC +[ 2378.310398] {5}[Hardware Error]: type: DDR (0x50), common_reg_nr:1 +[ 2378.316649] {5}[Hardware Error]: SError +[ 2378.320558] {5}[Hardware Error]: ECCERRCNT: 0x10000, ECCSTAT: 0x0, ADVECCSTAT: 0x8000002, ECCSYMBOL: 0x6f0000, ECCERRCNTSTAT: 0x0, ECCERRCNT0: 0x0, ECCERRCNT1: 0x0, ECCCADDR0: 0x0, ECCCADDR1: 0x0, ECCCDATA0: 0x0, ECCCDATA1: 0x0, ECCUADDR0: 0xbf, ECCUADDR1: 0x2010080, ECCUDATA0: 0x1ff, ECCUDATA1: 0x0 +[ 2378.360399] Memory failure: 0x400afd01: recovery action for dirty LRU page: Recovered +``` + +In contrast, inject an uncorrected error and trigger it by writing a quad word. The write size is 64 bits and the platform will not signal SErrors. + +```bash +# einj_mem_uc strqword +0: strqword vaddr = 0xffff991b5400 paddr = 92b73c400 +injecting ... +triggering ... +Manually take page offline +Test passed +``` + +The dmesg log: + +```bash +[270286.564242] Memory failure: 0x92b73c: recovery action for dirty LRU page: Recovered +``` + +##### QEMU Validation + +First, start a VM with a stdio monitor which allows giving complex commands to the QEMU emulator. + +```bash +qemu-system-aarch64 -enable-kvm \ + -cpu host \ + -M virt,gic-version=3 \ + -m 8G \ + -d guest_errors \ + -rtc base=localtime,clock=host \ + -smp cores=2,threads=2,sockets=2 \ + -object memory-backend-ram,id=mem0,size=4G \ + -object memory-backend-ram,id=mem1,size=4G \ + -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ + -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ + -bios /usr/share/AAVMF/AAVMF_CODE.fd \ + -drive driver=qcow2,media=disk,cache=writeback,if=virtio,id=alinu1_rootfs,file=/media/nvme/shawn.xs/qemu/aliyun_3_arm64_20G_alpha_alibase_20210425.qcow2 \ + -netdev user,id=n1,hostfwd=tcp::5555-:22 \ + -serial telnet:localhost:4321,server,nowait \ + -device virtio-net-pci,netdev=n1 \ + -monitor stdio +QEMU 7.2.0 monitor - type 'help' for more information +(qemu) VNC server running on 127.0.0.1:5900 +``` + +Login guest and install ras-tools, then run `einj_mem_uc` to allocate a page in userspace, dumps the virtual and physical address of the page. The `-j` is to skip error injection and `-k` is to wait for a kick. + +``` bash +$ einj_mem_uc single -j -k +0: single vaddr = 0xffffb2f27000 paddr = 154aba000 +``` + +Run command `gpa2hpa` in QEMU monitor and it will print the host physical address at which the guest’s physical address addr is mapped. + +``` bash +(qemu) gpa2hpa 0x151f21400 +Host physical address for 0x154aba000 (mem1) is 0x92b3c5000 +``` + +Inject an uncorrected error via the APEI interface to the finally translated host physical address on host. + +``` bash +echo 0x92b3c5000 > /sys/kernel/debug/apei/einj/param1 +echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2 +echo 0x0 > /sys/kernel/debug/apei/einj/flags +echo 0x10 > /sys/kernel/debug/apei/einj/error_type +echo 1 > /sys/kernel/debug/apei/einj/notrigger +echo 1 > /sys/kernel/debug/apei/einj/error_inject +``` + +Then kick `einj_mem_uc` to trigger the error by writing "trigger_start". In this example, the kick is done on host. + +``` bash +#ssh -p 5555 root@localhost "echo trigger > ~/trigger_start" +``` + +We will observe that the QEMU process exit. + +``` bash +(qemu) qemu-system-aarch64: Hardware memory error! +``` + +The dmesg log: + +``` bash +[ 2705.654424] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:0 module:0 rank:1 bank_group:4 bank_address:2 device:0 row:1196 column:640 chip_id:0 page:0x92b3c5 offset:0x0 grain:1 - APEI location: node:0 card:0 module:0 rank:1 bank_group:4 bank_address:2 device:0 row:1196 column:640 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +[ 2705.654432] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 2705.665047] {6}[Hardware Error]: event severity: recoverable +[ 2705.670692] {6}[Hardware Error]: precise tstamp: 2023-01-17 18:20:29 +[ 2705.677118] {6}[Hardware Error]: Error 0, type: recoverable +[ 2705.682762] {6}[Hardware Error]: section_type: memory error +[ 2705.688492] {6}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 2705.697521] {6}[Hardware Error]: physical_address: 0x000000092b3c5000 +[ 2705.704121] {6}[Hardware Error]: node:0 card:0 module:0 rank:1 bank_group:4 bank_address:2 device:0 row:1196 column:640 chip_id:0 +[ 2705.716014] {6}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC +[ 2705.722960] {6}[Hardware Error]: type: DDR (0x50), common_reg_nr:1 +[ 2705.729212] {6}[Hardware Error]: Synchronous Exception taken in EL0 +[ 2705.735551] {6}[Hardware Error]: ESR: 0x92000410, ELR: 0x401880, FAR: 0xffffb2e8c1d8, SCR: 0x403073d, SCTLR: 0x30cd183f, LR: 0x401840 +[ 2705.747619] {6}[Hardware Error]: ECCERRCNT: 0x10000, ECCSTAT: 0x0, ADVECCSTAT: 0x8000002, ECCSYMBOL: 0x60000, ECCERRCNTSTAT: 0x0, ECCERRCNT0: 0x0, ECCERRCNT1: 0x0, ECCCADDR0: 0x0, ECCCADDR1: 0x0, ECCCDATA0: 0x0, ECCCDATA1: 0x0, ECCUADDR0: 0x10004ac, ECCUADDR1: 0x4020280, ECCUDATA0: 0x1ff, ECCUDATA1: 0x0 +[ 2705.887179] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:0 module:0 rank:1 bank_group:4 bank_address:2 device:0 row:1196 column:640 chip_id:0 page:0x92b3c5 offset:0x0 grain:1 - APEI location: node:0 card:0 module:0 rank:1 bank_group:4 bank_address:2 device:0 row:1196 column:640 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +[ 2705.887181] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 +[ 2705.897824] {7}[Hardware Error]: event severity: recoverable +[ 2705.903468] {7}[Hardware Error]: precise tstamp: 2023-01-17 18:20:29 +[ 2705.909893] {7}[Hardware Error]: Error 0, type: recoverable +[ 2705.915537] {7}[Hardware Error]: section_type: memory error +[ 2705.921267] {7}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) +[ 2705.930296] {7}[Hardware Error]: physical_address: 0x000000092b3c5000 +[ 2705.936895] {7}[Hardware Error]: node:0 card:0 module:0 rank:1 bank_group:4 bank_address:2 device:0 row:1196 column:640 chip_id:0 +[ 2705.948790] {7}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC +[ 2705.955736] {7}[Hardware Error]: type: DDR (0x50), common_reg_nr:1 +[ 2705.961988] {7}[Hardware Error]: Synchronous Exception taken in EL0 +[ 2705.968326] {7}[Hardware Error]: ESR: 0x92000410, ELR: 0x401880, FAR: 0xffffb2e8c1d8, SCR: 0x403073d, SCTLR: 0x30cd183f, LR: 0x401840 +[ 2705.980394] {7}[Hardware Error]: ECCERRCNT: 0x0, ECCSTAT: 0x0, ADVECCSTAT: 0x0, ECCSYMBOL: 0x0, ECCERRCNTSTAT: 0x0, ECCERRCNT0: 0x0, ECCERRCNT1: 0x0, ECCCADDR0: 0x0, ECCCADDR1: 0x0, ECCCDATA0: 0x0, ECCCDATA1: 0x0, ECCUADDR0: 0x10004ac, ECCUADDR1: 0x4020280, ECCUDATA0: 0x0, ECCUDATA1: 0x0 +[ 2706.006235] Memory failure: 0x92b3c5: Sending SIGBUS to qemu-system-aar:32293 due to hardware memory corruption +[ 2706.078549] Memory failure: 0x92b3c5: recovery action for dirty LRU page: Recovered +[ 2706.092539] Memory failure: 0x92b3c5: already hardware poisoned +[ 2706.118501] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank_group:1 bank_address:2 device:0 row:920 column:896 chip_id:0 page:0x0 offset:0x0 grain:1 - APEI location: node:0 card:0 module:0 rank:0 bank_group:1 bank_address:2 device:0 row:920 column:896 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) +``` + +Note, QEMU registers SIGBUS handler and sets `PR_MCE_KILL_EARLY` by `prctl`. When an AO error occurs, e.g. detected by scrubber, kernel will also send SIGBUS but with sicode `BUS_MCEERR_AO 5`. + +##### HWPoison-aware Strategy + +First, check the strategy on your system. + +```bash +#sysctl vm.memory_failure_early_kill +vm.memory_failure_early_kill = 0 +``` + +Change to early kill mode: + +```bash +#sysctl -w vm.memory_failure_early_kill=1 +vm.memory_failure_early_kill = 1 +``` + +Then inject a `patrol` error to see the kernel behavior. + +```bash +#./einj_mem_uc patrol +0: patrol vaddr = 0xffffbe4b8400 paddr = 901656400 +injecting ... +triggering ... +signal 7 code 5 addr 0xffffbe4b8000 +Unexpected SIGBUS +page not present +Test passed +``` + +As we expected, the kernel sends SIGBUS to kill the process even though it does not access the poison data. The `code 5` here means `BUS_MCEERR_AO 5`. + +#### Memory Predictive Failure Analysis Validation + +First of all, you'll need to install **rasdeamon**, it's packaged for most Linux distributions: + +```bash +yum install rasdaemon +``` + +Then we'll setup **rasdaemon** to launch at startup and to record events to an on-disk sqlite database. + +```bash +# systemctl enable rasdaemon +# systemctl start rasdaemon +``` + +Here, we manually change the `PAGE_CE_THRESHOLD="5"` in config file `/etc/sysconfig/rasdaemon` so that we can inject and exceed a page error threshold more easily. Note, run-time configuration is unsupported, service restart is needed. + +```bash +# systemctl restart rasdaemon +``` + +Run `victim` with `-p` option to help test PFA function. The `victim` allocates a page in userspace, dumps the virtual and physical address of the page, and checks the physical address in a loop while. Then inject to the physical address 5 times and it will trigger soft action in which kernel soft-offline the old page, by moving the contents to a new page. + +```bash +#victim -d -p +physical address of (0xffffa5a66000) = 0x967cf1000 +Page was replaced. New physical address = 0x8bce3e000 +``` + +## Acknowledgment + +Thanks to the developers who contributed to the Linux and Anolis communities. + +## Reference + +1. [https://www.intel.com/content/www/us/en/developer/articles/technical/new-reliability-availability-and-serviceability-ras-features-in-the-intel-xeon-processor.html](https://www.intel.com/content/www/us/en/developer/articles/technical/new-reliability-availability-and-serviceability-ras-features-in-the-intel-xeon-processor.html) +2. Reliability, Availability and Serviceability (RAS) Integration and Validation Guide for the Intel® Xeon® Processor E7- v3 Family: [https://www.intel.com/content/dam/develop/external/us/en/documents/emca2-integration-validation-guide-556978.pdf](https://www.intel.com/content/dam/develop/external/us/en/documents/emca2-integration-validation-guide-556978.pdf) +3. [https://docs.kernel.org/admin-guide/ras.html](https://docs.kernel.org/admin-guide/ras.html) +4. [https://static.linaro.org/connect/sfo17/Presentations/SFO17-203%20-%20Reliability%2C%20Availability%2C%20and%20Serviceability%20%28RAS%29%20on%20ARM64%20status.pdf](https://static.linaro.org/connect/sfo17/Presentations/SFO17-203%20-%20Reliability%2C%20Availability%2C%20and%20Serviceability%20%28RAS%29%20on%20ARM64%20status.pdf) +5. Intel® 64 and IA-32 Architectures Software Developer’s Manual +6. [https://developer.ibm.com/articles/l-kernel-memory-access/](https://developer.ibm.com/articles/l-kernel-memory-access/) +7. [https://docs.kernel.org/admin-guide/sysctl/vm.html#memory-failure-early-kill](https://docs.kernel.org/admin-guide/sysctl/vm.html#memory-failure-early-kill) +8. Programming persistent memory: A comprehensive guide for developers +9. [https://trustedfirmware-a.readthedocs.io/en/latest/components/sdei.html](https://trustedfirmware-a.readthedocs.io/en/latest/components/sdei.html#id2) \ No newline at end of file diff --git a/PRODUCT_DOCS/test/test2.md b/PRODUCT_DOCS/test/test2.md new file mode 100644 index 0000000000000000000000000000000000000000..1732b23fb1d40f1ea16918b93c3b8ce29d5cbbbe --- /dev/null +++ b/PRODUCT_DOCS/test/test2.md @@ -0,0 +1,34 @@ +## 与会人 +王云志,何佳,黄睿,王宝林,云孟,荆石,刘长生(搏元), +疏明,何求,章新豪
+费斐,帅家坤,Joyce(Linaro),Chase Qi(Linaro) +- 本次轮值主持:贺军 +- 下次轮值主持:王宝林 + +## 主题 +### 1. Arm architectural features support in kernel and toolchain +[Arm A-profile 架构手册](https://developer.arm.com/documentation/ddi0487/latest/) + +Arm架构的各个版本特性的软件支持情况链接: +- Linux Kernel: https://developer.arm.com/Tools%20and%20Software/Linux%20Kernel#Components +- GCC: https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain#Supported-Devices +- LLVM: https://developer.arm.com/Tools%20and%20Software/LLVM%20Toolchain#Supported-Devices + +### 2. Kernel test - LKFT introduction +- 来自Linaro的Chase介绍了[Linaro LKFT平台](https://lkft.linaro.org/tests/)的测试流程、用例和支持的硬件 +- 来自阿里的疏明介绍了目前龙蜥社区的测试情况 + - 社区使用T-One(https://tone.openanolis.cn/)做为测试平台,默认覆盖x86和Arm。 测试硬件是Ali ECS的实例。 + - 对CloudKernel的测试构建以build和boot为主;也有通用的regression测试。常用测试方式包括ltp,kselftest和xfstest等 + - 社区需要针对Arm处理器features的专门测试集(类似于lkvs) + - Arm方面会对此调研,并作后续沟通讨论 +### 3. Live patch on Arm +- 来自Arm的何佳介绍了Arm平台上live patch功能在Linux kernel上游和国内distro社区的支持情况,探讨了不同实现方式的局限性和未来的主流做法。 +- 龙蜥社区对目前5.10中的对Live patch的支持方式没有计划进行改动以保持对上游kernel的一致。 + - Arm建议给出相应文档以方便下游OSV和开发者制作正确的patch +- 对LTS选定的6.6内核,Arm SIG同意以kernel上游社区的实现(patch原型)为基础,进行backport。测试覆盖以kselftest和hotfix为标准。 +### 4. Arm SIG例会形式 +参会者同意以双周会的方式举办例会,具体形式为每两周的周二上午10点。此前议题的收集以邮件列表和钉钉群为主,到达率不高,需要考虑其他更公开方便的途径。 + + +## 遗留问题/跟进任务 +1. 王宝林:调研收集议题的途径 \ No newline at end of file diff --git a/PRODUCT_DOCS/test/test3.md b/PRODUCT_DOCS/test/test3.md new file mode 100644 index 0000000000000000000000000000000000000000..91235427fb96e341882aeb15ce3c6e126553a4d1 --- /dev/null +++ b/PRODUCT_DOCS/test/test3.md @@ -0,0 +1,117 @@ +# Anolis OS Cloud Kernel: datop技术设计细则 +## 概述 +在本文中,我们基于社区DAMON设计用于跟踪实时内存热点数据的工具DATOP, 采用划分内存区域采样的方式,并自适应区域构建技术来获取极低的开销损耗,在此基础上,还增加了numa 仿真功能,用于收集任务跨numa访存情况,为了评估其准确性和低开销能力,我们选取和测试了多个benchmark,并与基线进行比较, 结果表明:DATOP工具运行时开销非常小(保持在不到4%左右),此外开销不受负载大小的影响,仍然保持出色的监控质量,通过多方面测试,我们得出结论:DATOP工具在识别冷热内存以及跨numa访存方面具备优秀的表现能力。 + +## 背景:云计算大规模复杂场景下内存面临的挑战 +云计算领域场景下,海量用户数据的增长,对计算机软件和硬件设计都带来了巨大的挑战,尤其是内存设备如DRAM的速率提升并没有跟上这种高速增长趋势,在数据中心中,海量数据的处理,常常让服务器饱受内存不足之苦。 + +为克服低内存容量,混合内存使用如DRAM搭配PMEM成为未来数据中心的主流方案,但是如何快速识别热点数据并让其准确保持在DRAM中运行是影响性能的关键因素,这就要求系统具备快速识别热点数据的能力,并能动态跟踪捕获热点数据的变化,让其处于高性能的DRAM中,但是不幸的是,现有工具为达到一定的准确度,通常会耗费大量的时间,并引入额外的overhead,造成性能回退。 + +此外服务器硬件架构的快速迭代,cpu核数越来越多,numa节点也越来越多,例如amd服务器numa node数目达到8个,arm服务器飞腾s2500 numa节点达到16个,跨numa节点访存带来的性能影响日益突出,如何低开销高效的识别出跨numa热点数据,并优化之,对于提升系统服务质量,有着重要的意义。 + +## datop:轻量级靶向热点内存扫描工具 +### 热点扫描原理及策略 +在内存领域,对内存进行优化其实依靠预测内存的行为而做的决策,但是能够高效准确的预测内存的走势,其实是非常困难的,此外内存的策略优化对于用户来说,是不透明的,因此现有内存领域的各种策略机制,在实际生产环境中,并未取得很好的效果,也正是基于这些原因,社区推出了一种新的内存调控机制DAMON(Data Access Monitor),它试图让向用户展示内存的动作行为,让用户可根据这些行为相应的调整内存管理策略。 + +### 三大chunks +服务器现有硬件能支持非常巨大的地址空间,内存容量动不动达到几个T的大小,工作负载耗用内存几个G也是很普遍的事,随机毫无规律的划分地址空间可肯定是不可取的,并且在实际使用中,只有小部分区域被实际映射到内存并被访问,DAMON通过分析和论证,先将地址分为三大chunks,这三个chunks区域之间的空隙是给定地址空间中两个最大的未映射区域。 + +图1: +![](../assets/datop1.png) +在大多数情况下,两个最大的未映射区域是堆与最上层mmap() ed区域之间的空隙,以及最下层mmap() ed区域与堆栈之间的空隙, 因为这些间隙在通常的地址空间中是非常大的,排除这些就足够了,所以针对一个工作负载而言,DAMON只需要监控一下这些区域就足够了, 此外,随着工作负载的运行,映射区域发生变化,例如部分最大ummaped区域易主,所以damon设置了一个update周期,去周期性检测这些三大chunks的有效性,重新查找有效的三大chunks,并和现有监测得区域进行对比,删减或者增加。 + +### region拆分与合并 +在获取三大chunks后,damon会按照设定规则,去将这三个chuns划分为不均等的若干分regions, 如下图2所示: + +图2: +![](../assets/datop2.png) + +这些regions后面会随着热点频率去动态调整,进行拆分或者合并操作,其算法原理大致如下: + +拆分原则: + +- 大于等于2倍 DAMON_MIN_REGION才能拆分; +- 拆分region的size不能小于DAMON_MIN_REGION; +- region可以拆分为3个,当regions个数大于max_nr_regions的2/3后,降低拆分个数,由3变2; +- region个数必须保持在min_nr_regions和max_nr_regions范围内; +合并原则: + +- 两合并的regions必须收尾地址相等; +- 两者热点统计值之差必须小于设定阈值范围内; +### region采样与热点统计 +“trace page access bit”技术作为跟踪热点内存通用那就的技术手段,被业界广泛使用,但是该技术有一个固有的缺陷,那就是随着工作负载耗用内存增加,自身带来的开销和跟踪质量都会变糟,而damon通过region划分, 再结合空间采样方式良好的解决了该缺陷:假设一个region中,所有page都有相同的访问模式,这样的话,只需要监控一个page就够了。这样一来,在每个region里面,会随机选择一个page,在bitmap中把它对应的“accessed(访问过)”bit先清零,然后时不时地检查一下,如果这个page被访问过了,那么就算作这个region都被访问过了,它不是监视所有的页面,而是只监视reigon里面随机抽取的一个页面,该页面在某个采样时刻代表着这个region, 并将该值记录下来, 随着采样的不断进行,该region的热点就被统计下来了。 +``` +void check_access(region ∗ r) { + if (!r−>sampled_page) + goto next ; + if (accessed(r−>sampled_page) ) + r−>nr_accesses++; + next: + r−>sampled_page = rand_page(r->sampled_page); + clear_accessed(r−>sampled_page); +} +``` +上述伪代码只是介绍其采样和统计实现方法,其更复杂的逻辑处理关系,例如region合并和拆分后nr_access的值处理,以及nr_access周期性清零等本文不再介绍,有兴趣的读者可以依据damon实现源码自行分析。 + +### numa仿真实现 +社区damon能有有效的识别工作负载内存的冷热情况,但是对于工作负载内存跨numa访存这块,是无法做出判断的, 然而在实际业务中,跨numa访问造成的性能衰退问题真实存在,尤其是现今服务硬件多numa架构numa数目越来越多,正式基于以上原因,我们丰富了damon kernel部分代码,增加了内存numa访存情况。 + +和热点统计一样,numa仿真不会统计region中所有page的跨numa访问情况,而是利用damon空间采样方式,将damon获取的page在clear了access bit后,将其设置为pte none. +``` +void check_numa(region ∗ r) { + if (!r−>sampled_page) + goto next ; + if (local_numa_node(r−>sampled_page) ) + r−>nr_local++; + else + r->nr_remote++; + next: + r−>sampled_page = rand_page(r->sampled_page); + set_page_none(r−>sampled_page); +} +``` + +同样该部分伪代码只是介绍其numa基本实现,在实际中我们需要考虑pte处于swap,和page属于大页的情况,此外在pte设置为none后,会造成再次访问该page时发生page_fault和tlb miss的情况,我们测试发现,在某些频繁访问某块内存的工作负载中,造成一定的性能损耗,所以在实际使用中,我们增加了numa仿真开关,需要的时候去开启该功能。 + +### 小结 +基于上述几小节对damon以及numa仿真在kernel部分的实现机制的介绍,让我们对datop工具的实现原理有了清楚的认识,datop包括内核态部分和用户态部分,用户态可以通过perf调用功能,将内核态通过trace统计的热点信息捕获,并排序显示出来,详细调用流程入下图3显示。 + +图3: +![](../assets/datop3.png) + +在用户态,DATOP通过trace_event、damon dbgfs、以及numa switch接口和内核进行交互: + +蓝色绘制线部分:该部分是和用户态显示的核心, 通过内核kdamond线程将采样统计的相关值传递给trace接口, 用户态通过trace_event方式获取region区域热点信息,包括区域大小,access统计,进程信息以及跨numa统计等,最终通过窗口向用户展示。 + +黑色绘制线部分:该部分用于控制内核态线程kdamond的相关行为,通过damon dbfs接口,用于设置采样频率,更新周期,region个数划分,监控进程配置,kdamond线程的开启和关闭等。 + +绿色绘制线部分用于设定numa仿真功能的开启和关闭,此功能针对支持多numa的场景。 + +红色绘制线部分是热点工具的核心执行单元,用户态通过dbgfs接口开启监控后,kdamond线程被创建, 首先会查找被监控进程的三大chunks, 找到后,按照damon dbgfs接口设定的region个数方范围,对其进行拆分,此后按照设定好的采样频率进入周期性循环工作,直到被监控进程停止运行或用户操作dbgfs接口,停止监控,在周期性循环中,会对region热点进行随机采样并统计,此外还判定用户是否开启numa仿真功能,若开启还会对region跨numa情况进行统计,处理完成后,会更新采样结果,并通过trace event传递给用户,以上操作完成后,会依据kdamond线程运行时间,并在指定周期内,通过热点统计值对region进行拆分和合并操作,此外在更长的周期到来后,还会从新检查chunks的准确性,并按多加少减原则,修改region,以此保证热点内存跟踪的实时性和准确性。 + +至此DATOP技术实现原理介绍完毕,后面会进入介绍使用和数据测试方面的介绍。 + +## 使用 +在龙蜥社区[clouk-kernel](https://gitee.com/anolis/cloud-kernel) 5.10版本内核支持damon代码以及自研的numa仿真部分代码,结合开源的用户态工具[datop工具源码](https://gitee.com/anolis/data-profile-tools.git) 就可以运行起来了,datop工具支持单、多进程以及cgroup粒度内存热点监控以及跨numa访存监控 两种方式,详细使用可以参考源码readme部分。 +此外也可以参考龙蜥社区关于datop介绍的视频[轻量级靶向内存热点扫描工具介绍与入门](https://openanolis.cn/video/528538652696158417) + +目前龙蜥社区已整合datop工具到镜像中,你可以通过[datop rpm anolis a23](https://gitee.com/src-anolis-sig/data-profile-tools/tree/a23/) 在龙蜥社区5.10上构建完成的rpm包,此外也可以通过yum方式对datop工具进行安装。 +``` c +yum install datop +``` + +## 测试说明 +关于测试情况,可以参考先前文档介绍: +[datop测试情况介绍](https://openanolis.cn/sig/Cloud-Kernel/doc/721476494878572689) + +## 参考 +https://sjp38.github.io/post/damon/ +[Memory-management optimization with DAMON](https://lwn.net/Articles/812707/) + +[Using DAMON for proactive reclaim](https://lwn.net/Articles/863753/) + +[DAMON Extended To Offer Physical Memory Address Space Monitoring](https://www.phoronix.com/news/DAMON-Physical-Monitoring) + +[Proactively reclaiming idle memory](https://lwn.net/Articles/787611/) + +[damon用户态工具damo](https://github.com/awslabs/damo)