diff --git "a/app/zh/blogs/chenchao245/\346\225\205\351\232\234\346\263\250\345\205\245\346\241\206\346\236\266\344\275\277\347\224\250\346\214\207\345\215\227.md" "b/app/zh/blogs/chenchao245/\346\225\205\351\232\234\346\263\250\345\205\245\346\241\206\346\236\266\344\275\277\347\224\250\346\214\207\345\215\227.md" new file mode 100644 index 0000000000000000000000000000000000000000..ff0833cb89580c828490614414727b44498963bd --- /dev/null +++ "b/app/zh/blogs/chenchao245/\346\225\205\351\232\234\346\263\250\345\205\245\346\241\206\346\236\266\344\275\277\347\224\250\346\214\207\345\215\227.md" @@ -0,0 +1,177 @@ +--- +title: '故障注入框架' + +date: '2024-5-29' + +category: 'blog' +tags: '故障注入框架' + +archives: '2024-5-29' + +author: 'chenchao' + +summary: '故障注入框架' + +times: '15:30' +--- + +## 1 概述 + +本文主要介绍故障注入框架,支持开发者在预期点位注入预期故障,模拟外界故障导致的偶现问题,用于问题复现和验证。 + + +## 2 使用介绍 + +### 2.1 识别故障类型 + +现有故障类型包括如下5种: +| 故障名 | 含义 | +| ------------------------- | ---------- | +| DMS_FI_TYPE_PACKET_LOSS | 网络丢包 | +| DMS_FI_TYPE_NET_LATENCY | 网络延时 | +| DMS_FI_TYPE_CPU_LATENCY | cpu延时 | +| DMS_FI_TYPE_PROCESS_FAULT | 进程退出 | +| DMS_FI_TYPE_CUSTOM_FAULT | 自定义故障 | + +每种故障类型用两个维度描述,一个是故障点序列,一个是故障值 +| 故障类型 | 故障点序列 | 故障值 | 故障值含义 | +| :--------: | :------------------------: | :----------------------: | :--------: | +| CPU延时类 | SS_FI_CPU_LATENCY_ENTRIES | SS_FI_CPU_LATENCY_MS | 延时毫秒数 | +| 网络延时类 | SS_FI_NET_LATENCY_ENTRIES | SS_FI_NET_LATENCY_MS | 延时毫秒数 | +| 进程退出类 | SS_FI_PROCESSFAULT_ENTRIES | SS_FI_PEOCESS_FAULT_PROB | 退出概率 | +| 网络丢包类 | SS_FI_PACKET_LOSS_ENTRIES | SS_FIPACKET_LOSS_PROB | 丢包概率 | +| 自定义类 | SS_FI_CUSTOM_FAULT_ENTRIES | SS_FI_CUSTOM_FAULT_PARAM | 自定义使用 | + +故障点序列是一个形如1,2,3,4的字符串,表示开启1、2、3、4号点位的某个故障; +每个故障都有故障值,具体含义如表所示。 + +使用者首先需要识别需要注入故障类型,选择出合适的故障类型进行后续的注入。 + +### 2.2 识别故障点位 +#### 现有故障点如下 +DMS侧: + +| 请求消息 | 应答消息 | +| ------------------------------------------- | ------------------------------- | +| DMS_FI_REQ_ASK_MASTER_FOR_PAGE = 1 | DMS_FI_ACK_CHECK_VISIBLE = 35 | +| DMS_FI_REQ_ASK_OWNER_FOR_PAGE | DMS_FI_ACK_PAGE_OWNER_ID | +| DMS_FI_REQ_INVALIDATE_SHARE_COPY | DMS_FI_ACK_BROADCAST | +| DMS_FI_CLAIM_OWNER | DMS_FI_ACK_BROADCAST_WITH_MSG | +| DMS_FI_REQ_CR_PAGE | DMS_FI_ACK_PAGE_READY | +| DMS_FI_REQ_ASK_MASTER_FOR_CR_PAGE | DMS_FI_ACK_GRANT_OWNER | +| DMS_FI_REQ_ASK_OWNER_FOR_CR_PAGE | DMS_FI_ACK_ALREADY_OWNER | +| DMS_FI_REQ_CHECK_VISIBLE | DMS_FI_ACK_CR_PAGE | +| DMS_FI_REQ_TRY_ASK_MASTER_FOR_PAGE_OWNER_ID | DMS_FI_ACK_TXN_WAIT | +| DMS_FI_REQ_BROADCAST | DMS_FI_ACK_LOCK | +| DMS_FI_REQ_TXN_INFO | DMS_FI_ACK_TXN_INFO | +| DMS_FI_REQ_TXN_SNAPSHOT | DMS_FI_ACK_TXN_SNAPSHOT | +| DMS_FI_REQ_WAIT_TXN | DMS_FI_ACK_WAIT_TXN | +| DMS_FI_REQ_AWAKE_TXN | DMS_FI_ACK_AWAKE_TXN | +| DMS_FI_REQ_MASTER_CKPT_EDP | DMS_FI_ACK_MASTER_CKPT_EDP | +| DMS_FI_REQ_OWNER_CKPT_EDP | DMS_FI_ACK_OWNER_CKPT_EDP | +| DMS_FI_REQ_MASTER_CLEAN_EDP | DMS_FI_ACK_MASTER_CLEAN_EDP | +| DMS_FI_REQ_OWNER_CLEAN_EDP | DMS_FI_ACK_OWNER_CLEAN_EDP | +| DMS_FI_REQ_MGRT_MASTER_DATA | DMS_FI_ACK_ERROR | +| DMS_FI_REQ_RELEASE_OWNER | DMS_FI_ACK_RELEASE_PAGE_OWNER | +| DMS_FI_REQ_BOC | DMS_FI_ACK_INVLDT_SHARE_COPY | +| DMS_FI_REQ_CONFIRM_CVT | DMS_FI_ACK_BOC | +| DMS_FI_REQ_DDL_SYNC | DMS_FI_ACK_EDP_LOCAL | +| DMS_FI_REQ_INVALID_OWNER | DMS_FI_ACK_EDP_READY | +| DMS_FI_REQ_ASK_RES_OWNER_ID | DMS_FI_ACK_INVLD_OWNER | +| DMS_FI_REQ_PROTOCOL_MAINTAIN_VERSION | DMS_FI_ACK_ASK_RES_OWNER_ID | +| DMS_FI_REQ_CREATE_GLOBAL_XA_RES | DMS_FI_ACK_CREATE_GLOBAL_XA_RES | +| DMS_FI_REQ_DELETE_GLOBAL_XA_RES | DMS_FI_ACK_DELETE_GLOBAL_XA_RES | +| DMS_FI_REQ_ASK_XA_OWNER_ID | DMS_FI_ACK_ASK_XA_OWNER_ID | +| DMS_FI_REQ_END_XA | DMS_FI_ACK_END_XA | +| DMS_FI_REQ_ASK_XA_IN_USE | DMS_FI_ACK_XA_IN_USE = 65 | +| DMS_FI_REQ_MERGE_XA_OWNERS | | +| DMS_FI_REQ_XA_REBUILD | | +| DMS_FI_REQ_XA_OWNERS | | +| DMS_FI_REQ_RECYCLE = 34 | | + +openGauss侧: +| DB_FI_CHANGE_BUFFERTAG_BLOCKNUM = **10000** | +| ------------------------------------------- | + +使用者根据实际情况在合适点位注入故障,合适点位包括上面的现有点位或者自定义添加点位。 +注:使用者想新增点位,只需在DMS中的dms_fi_point_name_e枚举或openGauss中的db_fi_point_name枚举中新增枚举,并在对应地点调用DMS_FAULT_INJECTION_CALL和SS_FAULT_INJECTION_CALL,入参传入对应枚举值即可。 + +### 2.3 新增点位注入故障 +DMS侧宏介绍: + +​ DMS_FAULT_INJECTION_CALL(point,...):激活故障,如果是CPU延时类、网络延时类、进程退出类立即执行 + +​ FAULT_INJECTION_ACTION_TRIGGER(action):执行丢包故障 + +​ FAULT_INJECTION_ACTION_TRIGGER_CUSTOM(action):执行自定义故障 + +openGauss侧宏介绍: + +​ SS_FAULT_INJECTION_CALL(point,...):激活故障,如果是CPU延时类、网络延时类、进程退出类立即执行 + +​ FAULT_INJECTION_ACTION_TRIGGER_CUSTOM(action):执行自定义故障 + +| | CPU延时类
网络延时类
进程退出类 | 网络丢包类 | 自定义故障 | +| ----------- | -------------------------------------- | ----------------------------------------------------------- | ------------------------------------------------------------ | +| DMS侧 | DMS_FAULT_INJECTION_CALL | DMS_FAULT_INJECTION_CALL
FAULT_INJECTION_ACTION_TRIGGER | DMS_FAULT_INJECTION_CALL
FAULT_INJECTION_ACTION_TRIGGER_CUSTOM | +| openGauss侧 | SS_FAULT_INJECTION_CALL | | SS_FAULT_INJECTION_CALL
FAULT_INJECTION_ACTION_TRIGGER_CUSTOM | + +自定义故障需要新增自定义故障函数,作为DMS_FAULT_INJECTION_CALL或SS_FAULT_INJECTION_CALL的入参。 +***注***:CALL和TRIGGER配合使用的时候,CALL进行前置操作,TRIGGER真正触发故障。 + + +### 2.4 开启故障 +1.延时类:设置网络延时和cpu延时的多个故障点,设置延时时间,观察日志,是否延时生效; + +例:alter system set ss_fi_net_latency_entries='1,2,3,4'; //在点位1、2、3、4号点位设置网络延时 + +​ alter system set ss_fi_net_latency_ms=30; //ss_fi_net_latency_ms范围0-4294967295 + +​ 故障成功生效日志:[DMS_FI]entry:%d triggers network latency + +​ alter system set ss_fi_cpu_latency_entries='1,2,3,4'; //在点位1、2、3、4号点位设置cpu延时 + +​ alter system set ss_fi_cpu_latency_ms=30; //ss_fi_cpu_latency_ms范围0-4294967295 + +​ 故障成功生效日志:[DMS_FI]entry:%d triggers cpu latency + +2.进程退出类:设置进程退出的多个故障点,设置退出概率, 观察日志,是否进程退出 + +例:alter system set ss_fi_process_fault_entries='1,2,3,4'; //在点位1、2、3、4号点位设置进程退出 + +​ alter system set ss_fi_process_fault_prob=30; //ss_fi_process_fault_prob范围0-100 + +​ 故障成功生效日志:[DMS_FI]entry:%d triggers proc fault exit, %d in %d + +3.丢包类:设置丢包的多个故障点,设置丢包概率,观察日志,是否响应请求消息或者响应消息发送失败 + +例:alter system set ss_fi_packet_loss_entries='1,2,3,4'; //在点位1、2、3、4号点位设置丢包 + +​ alter system set ss_fi_packet_loss_prob=30; //ss_fi_packet_loss_prob范围0-100 + +​ 故障成功生效日志:[DMS_FI]triggers packloss cmd:%u, %d in %d + +4.自定义类:在最近问题单中找到一个偶现问题,修改代码到原来版本,通过在对应点位自定义故障模拟问题单故障,运行tpcc导入数据后,得到跟问题单相同core。更新代码到最新,进行相同操作,无core产生。 + +例:alter system set ss_fi_custom_fault_entries='10000'; //设置自定义故障点位 + +​ alter system set ss_fi_custom_fault_param=30; //ss_fi_custom_fault_param范围0-4294967295 + +​ 故障成功生效日志:[DMS_FI]entry:%d triggers cust fault + +### 3 总结 +#### 3.1 在现有点位注入故障 +使用 alter system set ss_fi_XXX_entries='x,x,x,x' 开启x,x,x,x这4个点位的XXX故障 +然后使用 alter system set ss_fi_xxxxx=XX,设置想要注入故障的故障值 + +#### 3.2 新增点位注入故障 +使用者想新增点位,只需在DMS中的dms_fi_point_name_e枚举或openGauss中的db_fi_point_name枚举中新增枚举,并在对应代码位置调用DMS_FAULT_INJECTION_CALL或SS_FAULT_INJECTION_CALL,第一个入参传入对应枚举值即可。 +使用 alter system set ss_fi_XXX_entries='x' x为新增点位枚举值,打开x点位的XXX故障 +然后使用 alter system set ss_fi_xxxxx=XX,设置想要注入故障的故障值 + +#### 3.3 注入自定义故障 +使用者想新增自定义故障,首先完成3.2中的新增点位,然后根据自己需求实现自定义故障函数, +在对应代码位置调用DMS_FAULT_INJECTION_CALL或SS_FAULT_INJECTION_CALL,第二个入参传入自定义故障函数即可(可空实现) +最后调用FAULT_INJECTION_ACTION_TRIGGER_CUSTOM,传入实际故障操作 +使用 alter system set ss_fi_custom_fault_entries='x' x为新增点位枚举值,打开x点位的自定义故障 +然后使用 alter system set ss_fi_custom_fault_param=XX,设置想要注入故障的故障值(可不使用) \ No newline at end of file