From 04803546854062b9b7f52349f856d661fbc2641b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=8B=8F=E7=AB=8B?= <404325854@qq.com> Date: Mon, 1 Jul 2024 09:54:27 +0800 Subject: [PATCH] =?UTF-8?q?=E4=BF=AE=E6=94=B9=E4=BA=A7=E5=93=81docs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- PRODUCT_DOCS/test/test1.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/PRODUCT_DOCS/test/test1.md b/PRODUCT_DOCS/test/test1.md index 232454f..9769d1a 100644 --- a/PRODUCT_DOCS/test/test1.md +++ b/PRODUCT_DOCS/test/test1.md @@ -52,7 +52,7 @@ Talking about detected hardware errors, we can classify memory errors as either - **Correctable Error (CE)** - the hardware error detection mechanism detected and automatically corrected the error. - **Uncorrected errors (UCE)** - are severe enough, hardware detects but cannot correct. -![MCA categories 2.png](../../assets/MCA_categories_2.png) +![MCA categories 2.png](../assets/datop1.png) Typically, uncorrectable errors further fall into three categories: @@ -71,12 +71,12 @@ Prior to enhanced machine check architecture (EMCA), IA32-legacy version of Mach EMCA enables BIOS-based recovery from errors which redirects MCE and CMCI to firmware first (via SMI) before sending it to the OS error handler. It allows firmware first to handle, collect, and build enhanced error logs then report to system software. -![ras_x86.png](../../assets/ras_x86.png) +![ras_x86.png](../assets/datop1.png) ### ARM v8.2 RAS Extension The RAS Extension is a mandatory extension to the Armv8.2 architecture, and it is an optional extension to the Armv8.0 and Armv8.1 architectures. The figure shows a basic workflow with Firmware First mode. -![m1_ras_flow.png](../../assets/m1_ras_flow.png) +![m1_ras_flow.png](../assets/datop2.png) - Prerequisite: System boot and init @@ -101,7 +101,7 @@ To reduce systems downtime, the OS recovery process for ensuring reliable hardwa The figure shows the system error handling flow with Anolis OS. -![RAS_OS_Error_Flow.png](../../assets/RAS_OS_Error_Flow.png) +![RAS_OS_Error_Flow.png](../assets/datop3.png) ### Memory Failure Recovery @@ -127,20 +127,18 @@ mcelog or RAS tracepoint, and the possible results of the actions appear to be i The HWPoison handler starts the recovery action by isolating the affected page and declaring it with a “poisoned” tag to disallow any reuse of the page. In the case of an AR-instruction abort event, the HWPoison handler then reloads the 4KB page containing the instruction to a new physical page and resumes normal operation. In the case of an AR-data abort event, the HWPoison handler triggers a “SIGBUS” event to take further recovery action by notifying only the accessing process or any owner process which is configured by hwpoison-aware technique like prctl or early kill. The application has a choice to either reload the data and resume normal execution, or terminate the application to avoid crashing the entire system. -![EL0_Recovery.png](../../assets/EL0_Recovery.png) + #### Kernel Space Action Required Recovery The kernel itself resides in one address space, and contains a process scheduler, networking stack, virtual file system, and device drivers for hardware support, to name just a few, shared by all user space processes. When a user space application requires the services provided by the kernel, it will signal the kernel to execute a syscall, and switch to kernel mode for the duration of the syscall execution. In principle, if any UCE error was triggered while executing OS kernel code, then the UCE error will be fatal. Kernel also provides user space memory access APIs for cross-space data movement from or to user memory. Cross-space data movements are limited to perform in Linux by special functions, defined in ``. Such a movement is either performed by a generic (memcpy-like) function or by functions optimized for a specific data size (char, short, int, long); The role of the data-movement functions is shown in following figure as it relates to the types involved for copy (simple vs. aggregate), note, not all user access API is showed. -![uaccess.png](../../assets/uaccess.png) For example, when a user process tries to write a buffer to a file, kernel will copy the data from userspace and then write them to disk. If a UCE error occurs in the userspace buffer, kernel will consume the poison data while copying data from userspace. In such case, a system wide reboot is not unnecessary. The point behind Kernel Space Action Required Recovery is that the poison data manipulated by kernel is owned by the user process. If the application that initiated the copy and owned corrupt data can be easily identified by the kernel, it is possible to isolate the corrupt data by marking the affected page with the ‘poison’ tag and terminating the initiator/impacted applications to stop the corrupt data from spreading. The mechanism is to track uaccess in extable in advance and change pc to fixup handler while handling synchronous Errors. Then the uaccess will jump to fixup handler which then endups the uaccess process. If the exception is fixuped correctly, the kernel can avoid panic. In the copy from user case, e.g. initiated by write(2), it is not even necessary to send a SIGBUS. System calls should return -EFAULT or a short count for write(2). The Figure shows the basic workflow for Arm64 platform and the implementation of the X86 platform is similar. -![EL2_Recovery_x86.png](../../assets/EL2_Recovery_x86.png) #### Action Optional Recovery: Patrol Scrub -- Gitee