From 22b6377e12113b07093e44735cdb3a60c259bfc6 Mon Sep 17 00:00:00 2001
From: dowzyx <zhaoyuxing2@huawei.com>
Date: Mon, 19 Sep 2022 21:31:13 +0800
Subject: [PATCH] docs(gopher): modify README and gopher_tech_abnormal docs

---
 README.md               |  22 +++----
 gopher_tech_abnormal.md | 127 ++++++++++++++++++++++++----------------
 2 files changed, 89 insertions(+), 60 deletions(-)
diff --git a/README.md b/README.md
index f8db51e..a24ea37 100644
--- a/README.md
+++ b/README.md
@@ -68,9 +68,9 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr
 **术语**
 
 - **探针**：gala-gopher内执行具体数据采集任务的程序，包括native、extend 2类探针，前者以线程方式单独启动数据采集任务，后者以子进程方式启动数据采集任务。gala-gopher可以通过配置修改的方式启动部分或全部探针。
-- **观测实体（entity_name）**：用来定义系统内的观测对象，所有探针采集的数据均会归属到具体的某个观测实体。每种观测实体均有key、label（可选）、metrics组成，比如tcp_link观测实体的key包括进程号、IP五元组、协议族等信息，metrics则包括tx、rx、rtt等运行状态指标。除原生支持的[观测实体](https://gitee.com/openeuler/gala-docs#%E8%A7%82%E6%B5%8B%E5%AE%9E%E4%BD%93)，gala-gopher也可以扩展观测实体。
+- **观测实体（entity_name）**：用来定义系统内的观测对象，所有探针采集的数据均会归属到具体的某个观测实体。每种观测实体均有key、label（可选）、metrics组成，比如tcp_link观测实体的key包括进程号、IP五元组、协议族等信息，metrics则包括tx、rx、rtt等运行状态指标。除原生支持的[观测实体](#观测实体)，gala-gopher也可以扩展观测实体。
 - **数据表（table_name）**：观测实体由1张或更多数据表组合而成，通常1张数据表由1个采集任务完成，由此可知单个观测实体可以由多个采集任务共同完成。
-- **meta文件**：通过文件定义观测实体（包括内部的数据表），系统内meta文件必须保证唯一，定义不可冲突。规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#122-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9A%84meta%E6%96%87%E4%BB%B6)。
+- **meta文件**：通过文件定义观测实体（包括内部的数据表），系统内meta文件必须保证唯一，定义不可冲突。规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#meta%E6%96%87%E4%BB%B6%E5%AE%9A%E4%B9%89%E8%A7%84%E8%8C%83)。
 
 ### 支持的技术
 
@@ -86,21 +86,21 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr
 
 - **metrics集成方式**
 
-  **prometheus exporter方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric)，设置metrics成web上报方式，以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#webserver%E9%85%8D%E7%BD%AE)设置，gala-gopher就会以prometheus exporter方式工作，被动响应metrics数据GET请求。
+  **prometheus exporter方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D)，设置metric成web上报方式，并修改配置文件中<u>web_server</u>部分，gala-gopher就会以prometheus exporter方式工作，被动响应metrics数据GET请求。
 
-  **kafka client方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric)，设置metrics成kafka上报方式，以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置，gala-gopher就会以kafka client方式工作，周期性上报metrics。用户需将metrics数据转移至prometheus内。
+  **kafka client方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D)，设置metrics成kafka上报方式，并配置<u>kafka_topic</u> ，gala-gopher就会以kafka client方式工作，周期性上报metrics。用户需将metrics数据转移至prometheus内。
 
 - **event集成方式**
 
-  **logs方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event)，设置event成logs上报方式，以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置，gala-gopher就会以logs方式工作，将event以日志形式写入设定目录。用户可以通过读取该目录文件，获取gala-gopher上报的event信息并上送至kafka通道内。
+  **logs方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D)，设置event成logs上报方式，并通过<u>logs</u>部分配置日志路径，gala-gopher就会以logs方式工作，将event以日志形式写入设定目录。用户可以通过读取该目录文件，获取gala-gopher上报的event信息并上送至kafka通道内。
 
-  **kafka client方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event)，设置event成kafka上报方式，以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置，gala-gopher就会以kafka client方式工作，周期性上报event。
+  **kafka client方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D)，设置event成kafka上报方式，并配置<u>kafka_topic</u>，gala-gopher就会以kafka client方式工作，周期性上报event。
 
 - **meta文件集成方式**
 
-  **logs方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta)，设置meta成logs上报方式，以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置，gala-gopher就会以logs方式工作，将gala-gopher集成的所有meta文件以日志形式写入设定目录。用户需要将meta信息上送至kafka通道内。
+  **logs方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D)，设置meta成logs上报方式，并通过<u>logs</u>部分配置日志路径，gala-gopher就会以logs方式工作，将gala-gopher集成的所有meta文件以日志形式写入设定目录。用户需要将meta信息上送至kafka通道内。
 
-  **kafka client方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta)，设置event成kafka上报方式，以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置，gala-gopher就会以kafka client方式工作，周期性上报meta信息。
+  **kafka client方式**：用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D)，设置event成kafka上报方式，并配置<u>kafka_topic</u>，gala-gopher就会以kafka client方式工作，周期性上报meta信息。
 
 ### 扩展数据采集范围
 
@@ -108,13 +108,13 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr
 
 - **定义观测实体**
 
-通过定义观测实体（或者更新原观测实体）用于承载新增采集metrics数据。用户通过meta文件（规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#122-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9A%84meta%E6%96%87%E4%BB%B6)）定义观测实体的key、label（可选）、metrics，定义完成后，将meta文件归档在[探针目录](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#23-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9B%AE%E5%BD%95)。
+通过定义观测实体（或者更新原观测实体）用于承载新增采集metrics数据。用户通过meta文件（参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#2-%E5%AE%9A%E4%B9%89meta%E6%96%87%E4%BB%B6)）定义观测实体的key、label（可选）、metrics，定义完成后，将meta文件归档在[探针目录](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#%E5%BC%80%E5%8F%91%E8%A7%86%E5%9B%BE)。
 
 - **集成数据探针**
 
-用户可以通过各种编程语言（shell、python、java等）包装数据采集软件，并在脚本中按照meta文件定义[格式](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#123-%E8%BE%93%E5%87%BA%E6%8E%A2%E9%92%88%E6%8C%87%E6%A0%87)将采集到的数据通过linux管道符形式输出。
+用户可以通过各种编程语言（shell、python、java等）包装数据采集软件，并在脚本中按照meta文件定义格式将采集到的数据通过linux管道符形式输出，参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#3-%E8%BE%93%E5%87%BA%E6%8E%A2%E9%92%88%E6%8C%87%E6%A0%87-1)。
 
-参考：[cAdvisor](https://gitee.com/openeuler/gala-gopher/tree/master/src/probes/extends/python.probe/cadvisor.probe)第三方探针集成案例。
+参考[cAdvisor第三方探针集成案例](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#%E5%A6%82%E4%BD%95%E6%96%B0%E5%A2%9Eextends%E6%8E%A2%E9%92%88)。
 
 ## gala-spider
 
diff --git a/gopher_tech_abnormal.md b/gopher_tech_abnormal.md
index 1e8abe7..6578764 100644
--- a/gopher_tech_abnormal.md
+++ b/gopher_tech_abnormal.md
@@ -1,66 +1,95 @@
-# TCP（entity_name：tcp_link）
+# gala-gopher系统异常事件
 
-| metrics_name  | description                                | param             | level |
-| ------------- | ------------------------------------------ | ----------------- | ----- |
-| tcp_oom       | TCP out of memory(%u).                     | P1: error count   | WARN  |
-| backlog_drops | TCP backlog queue drops(%u).               | P1: drops count   | WARN  |
-| filter_drops  | TCP filter drops(%u).                      | P1: drops count   | WARN  |
-| syn_srtt      | TCP connection establish timed out(%u us). | P1: syn rtt times | WARN  |
+## 简介
 
-# ENDPOINT
+gala-gopher提供系统异常检测能力，支持用户在启动各个探针的时候，通过阈值(包括上下限)设置异常范围，探针会根据阈值判断某个指标是否异常，如果异常则上报异常事件。
 
-| metrics_name        | description                     | param              | level |
-| ------------------- | ------------------------------- | ------------------ | ----- |
-| listendrop          | TCP listen drops(%lu).          | P1: drops count    | WARN  |
-| accept_overflow     | TCP accept queue overflow(%lu). | P1: overflow count | WARN  |
-| syn_overflow        | TCP syn queue overflow(%lu).    | P1: overflow count | WARN  |
-| passive_open_failed | TCP passive open failed(%lu).   | P1: failed count   | WARN  |
-| active_open_failed  | TCP active open failed(%lu).    | P1: failed count   | WARN  |
-| bind_rcv_drops      | UDP(S) queue drops(%lu).        | P1: drops count    | WARN  |
-| udp_rcv_drops       | UDP(C) queue drops(%lu).        | P1: drops count    | WARN  |
+## 如何开启异常事件
 
+- 支持异常事件的探针参考[支持的异常事件](#支持的异常事件)。
+- 探针启动参数开启异常事件上报 `-l WARN` 。
+- 设置阈值，比如：设置资源利用率上限为80% `-U 80`，设置资源利用率下限为5% `-L 5` 。
 
+> 注：异常事件开关、阈值通过探针启动参数传递，探针启动参数参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E5%90%AF%E5%8A%A8%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D)。
 
-# THREAD（entity_name：task）
+## 支持的异常事件
 
-| metrics_name  | description                                                  | param                                                        | level |
-| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- |
-| off_cpu_ns    | Process(COMM:%s TID:%d) is preempted(COMM:%s PID:%d) and off-CPU %llu ns. | P1: process name P2: process id P3: process name P4: process id P5: off-cpu times | WARN  |
-| iowait_us     | Process(COMM:%s TID:%d) iowait %llu us.                      | P1: process name P2: process id P3: io-wait times            | WARN  |
-| hang_count    | Process(COMM:%s TID:%d) io hang %u.                          | P1: process name P2: process id P3: error count              | WARN  |
-| bio_err_count | Process(COMM:%s TID:%d) bio error %u.                        | P1: process name P2: process id P3: error count              | WARN  |
+本章以观测实体（`entity_name`）的粒度来介绍其支持的异常事件。
 
-# Process
+### TCP_LINK
 
-| metrics_name       | description                                                  | param                                                        | level |
-| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- |
-| syscall_failed     | Process(COMM:%s PID:%u) syscall failed(SysCall-ID:%d RET:%d COUNT:%u). | P1: process name P2: process id P3: syscall no P4: syscall ret-code P5 failed count | WARN  |
-| gethostname_failed | Process(COMM:%s PID:%u) gethostname failed(COUNT:%u).        | P1: process name P2: process id P3 failed count              | WARN  |
+| 异常事件名    | 事件信息                                   | 输出参数          | 输入参数 | 异常等级 |
+| ------------- | ------------------------------------------ | ----------------- | -------- | -------- |
+| tcp_oom       | TCP out of memory(%u).                     | P1: error count   | NA       | WARN     |
+| backlog_drops | TCP backlog queue drops(%u).               | P1: drops count   | [-D <>]  | WARN     |
+| filter_drops  | TCP filter drops(%u).                      | P1: drops count   | [-D <>]  | WARN     |
+| syn_srtt      | TCP connection establish timed out(%u us). | P1: syn rtt times | [-T <>]  | WARN     |
 
-# BLOCK
+> 注：输入参数为NA表示不需要外部输入阈值参数，内部实现是根据指标值是否为0判断异常与否。
 
-| metrics_name         | description                                                  | param                                                        | level |
-| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- |
-| count_iscsi_err      | Iscsi errors(%llu) occured on Block(%s, disk %s).            | P1: block name P2: disk name                                 | WARN  |
-| count_iscsi_tmout    | Iscsi timeout(%llu) occured on Block(%s, disk %s).           | P1: block name P2: disk name                                 | WARN  |
-| latency_flush_jitter | Jitter latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：flush jitter latency, unit is us P2: block name P3: disk name | WARN  |
-| latency_flush_max    | Latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：flush latency, unit is us P2: block name P3: disk name   | WARN  |
-| latency_req_jitter   | Jitter latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：request jitter latency, unit is us P2: block name P3: disk name | WARN  |
-| latency_req_max      | Latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：request latency, unit is us P2: block name P3: disk name | WARN  |
+### ENDPOINT
 
-# DISK
+| 异常事件名          | 事件信息                        | 输出参数           | 输入参数 | 异常等级 |
+| ------------------- | ------------------------------- | ------------------ | -------- | -------- |
+| listendrop          | TCP listen drops(%lu).          | P1: drops count    | NA       | WARN     |
+| accept_overflow     | TCP accept queue overflow(%lu). | P1: overflow count | NA       | WARN     |
+| syn_overflow        | TCP syn queue overflow(%lu).    | P1: overflow count | NA       | WARN     |
+| passive_open_failed | TCP passive open failed(%lu).   | P1: failed count   | NA       | WARN     |
+| active_open_failed  | TCP active open failed(%lu).    | P1: failed count   | NA       | WARN     |
+| bind_rcv_drops      | UDP(S) queue drops(%lu).        | P1: drops count    | NA       | WARN     |
+| udp_rcv_drops       | UDP(C) queue drops(%lu).        | P1: drops count    | NA       | WARN     |
 
-| metrics_name    | description                     | param          | level |
-| --------------- | ------------------------------- | -------------- | ----- |
-| inode_userd_per | Too many Inodes consumed(%d%%). | P1: Percentage | WARN  |
-| block_userd_per | Too many Blocks used(%d%%).     | P1: Percentage | WARN  |
-| iostat_util     | Disk device saturated(%.2f%%).  | P1: Percentage | WARN  |
+### THREAD
 
+| 异常事件名    | 事件信息                                                     | 输出参数                                                     | 输入参数 | 异常等级 |
+| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | -------- |
+| off_cpu_ns    | Process(COMM:%s TID:%d) is preempted(COMM:%s PID:%d) and off-CPU %llu ns. | P1: process name P2: process id P3: process name P4: process id P5: off-cpu times | NA       | WARN     |
+| iowait_us     | Process(COMM:%s TID:%d) iowait %llu us.                      | P1: process name P2: process id P3: io-wait times            | [-T <>]  | WARN     |
+| hang_count    | Process(COMM:%s TID:%d) io hang %u.                          | P1: process name P2: process id P3: error count              | NA       | WARN     |
+| bio_err_count | Process(COMM:%s TID:%d) bio error %u.                        | P1: process name P2: process id P3: error count              | NA       | WARN     |
 
+### PROC
 
-# NET
+| 异常事件名         | 事件信息                                                     | 输出参数                                                     | 输入参数 | 异常等级 |
+| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | -------- |
+| syscall_failed     | Process(COMM:%s PID:%u) syscall failed(SysCall-ID:%d RET:%d COUNT:%u). | P1: process name P2: process id P3: syscall no P4: syscall ret-code P5 failed count | NA       | WARN     |
+| gethostname_failed | Process(COMM:%s PID:%u) gethostname failed(COUNT:%u).        | P1: process name P2: process id P3 failed count              | NA       | WARN     |
 
-| metrics_name        | description                      | param           | level |
-| ------------------- | -------------------------------- | --------------- | ----- |
-| net_device_tx_drops | net device tx queue drops(%llu). | P1: drops count | WARN  |
-| net_device_rx_drops | net device rx queue drops(%llu). | P1: drops count | WARN  |
\ No newline at end of file
+### BLOCK
+
+| 异常事件名           | 事件信息                                                     | 输出参数                                                     | 输入参数 | 异常等级 |
+| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | :------- |
+| count_iscsi_err      | Iscsi errors(%llu) occured on Block(%s, disk %s).            | P1: block name P2: disk name                                 | NA       | WARN     |
+| count_iscsi_tmout    | Iscsi timeout(%llu) occured on Block(%s, disk %s).           | P1: block name P2: disk name                                 | NA       | WARN     |
+| latency_flush_jitter | Jitter latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：flush jitter latency, unit is us P2: block name P3: disk name | [-J <>]  | WARN     |
+| latency_flush_max    | Latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：flush latency, unit is us P2: block name P3: disk name   | [-T <>]  | WARN     |
+| latency_req_jitter   | Jitter latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：request jitter latency, unit is us P2: block name P3: disk name | [-J <>]  | WARN     |
+| latency_req_max      | Latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1：request latency, unit is us P2: block name P3: disk name | [-T <>]  | WARN     |
+
+### DISK
+
+| 异常事件名  | 事件信息                       | 输出参数       | 输入参数 | 异常等级 |
+| ----------- | ------------------------------ | -------------- | -------- | -------- |
+| iostat_util | Disk device saturated(%.2f%%). | P1: Percentage | [-U <>]  | WARN     |
+
+### DF
+
+| 异常事件名      | 事件信息                        | 输出参数       | 输入参数 | 异常等级 |
+| --------------- | ------------------------------- | -------------- | -------- | -------- |
+| inode_userd_per | Too many Inodes consumed(%d%%). | P1: Percentage | [-U <>]  | WARN     |
+| block_userd_per | Too many Blocks used(%d%%).     | P1: Percentage | [-U <>]  | WARN     |
+
+### NIC
+
+| 异常事件名           | 事件信息                          | 输出参数         | 输入参数 | 异常等级 |
+| -------------------- | --------------------------------- | ---------------- | -------- | -------- |
+| net_device_tx_drops  | net device tx queue drops(%llu).  | P1: drops count  | [-D <>]  | WARN     |
+| net_device_rx_drops  | net device rx queue drops(%llu).  | P1: drops count  | [-D <>]  | WARN     |
+| net_device_tx_errors | net device tx queue errors(%llu). | P1: errors count | [-D <>]  | WARN     |
+| net_device_rx_errs   | net device tx queue errors(%llu). | P1: errors count | [-D <>]  | WARN     |
+
+### CPU
+
+| 异常事件名 | 事件信息                          | 输出参数       | 输入参数 | 异常等级 |
+| ---------- | --------------------------------- | -------------- | -------- | -------- |
+| used_per   | Too high cpu utilization(%.2f%%). | P1: Percentage | [-U <>]  | WARN     |
\ No newline at end of file
-- 
Gitee