diff --git a/README.md b/README.md
index f8db51ed371e86a82e616e901bf87331a83c937c..a24ea37597e0ec6e433d3dee2e86dd6eb345867d 100644
--- a/README.md
+++ b/README.md
@@ -68,9 +68,9 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr
**术语**
- **探针**:gala-gopher内执行具体数据采集任务的程序,包括native、extend 2类探针,前者以线程方式单独启动数据采集任务,后者以子进程方式启动数据采集任务。gala-gopher可以通过配置修改的方式启动部分或全部探针。
-- **观测实体(entity_name)**:用来定义系统内的观测对象,所有探针采集的数据均会归属到具体的某个观测实体。每种观测实体均有key、label(可选)、metrics组成,比如tcp_link观测实体的key包括进程号、IP五元组、协议族等信息,metrics则包括tx、rx、rtt等运行状态指标。除原生支持的[观测实体](https://gitee.com/openeuler/gala-docs#%E8%A7%82%E6%B5%8B%E5%AE%9E%E4%BD%93),gala-gopher也可以扩展观测实体。
+- **观测实体(entity_name)**:用来定义系统内的观测对象,所有探针采集的数据均会归属到具体的某个观测实体。每种观测实体均有key、label(可选)、metrics组成,比如tcp_link观测实体的key包括进程号、IP五元组、协议族等信息,metrics则包括tx、rx、rtt等运行状态指标。除原生支持的[观测实体](#观测实体),gala-gopher也可以扩展观测实体。
- **数据表(table_name)**:观测实体由1张或更多数据表组合而成,通常1张数据表由1个采集任务完成,由此可知单个观测实体可以由多个采集任务共同完成。
-- **meta文件**:通过文件定义观测实体(包括内部的数据表),系统内meta文件必须保证唯一,定义不可冲突。规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#122-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9A%84meta%E6%96%87%E4%BB%B6)。
+- **meta文件**:通过文件定义观测实体(包括内部的数据表),系统内meta文件必须保证唯一,定义不可冲突。规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#meta%E6%96%87%E4%BB%B6%E5%AE%9A%E4%B9%89%E8%A7%84%E8%8C%83)。
### 支持的技术
@@ -86,21 +86,21 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr
- **metrics集成方式**
- **prometheus exporter方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric),设置metrics成web上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#webserver%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以prometheus exporter方式工作,被动响应metrics数据GET请求。
+ **prometheus exporter方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置metric成web上报方式,并修改配置文件中web_server部分,gala-gopher就会以prometheus exporter方式工作,被动响应metrics数据GET请求。
- **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric),设置metrics成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报metrics。用户需将metrics数据转移至prometheus内。
+ **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置metrics成kafka上报方式,并配置kafka_topic ,gala-gopher就会以kafka client方式工作,周期性上报metrics。用户需将metrics数据转移至prometheus内。
- **event集成方式**
- **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event),设置event成logs上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以logs方式工作,将event以日志形式写入设定目录。用户可以通过读取该目录文件,获取gala-gopher上报的event信息并上送至kafka通道内。
+ **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置event成logs上报方式,并通过logs部分配置日志路径,gala-gopher就会以logs方式工作,将event以日志形式写入设定目录。用户可以通过读取该目录文件,获取gala-gopher上报的event信息并上送至kafka通道内。
- **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event),设置event成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报event。
+ **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置event成kafka上报方式,并配置kafka_topic,gala-gopher就会以kafka client方式工作,周期性上报event。
- **meta文件集成方式**
- **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta),设置meta成logs上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以logs方式工作,将gala-gopher集成的所有meta文件以日志形式写入设定目录。用户需要将meta信息上送至kafka通道内。
+ **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置meta成logs上报方式,并通过logs部分配置日志路径,gala-gopher就会以logs方式工作,将gala-gopher集成的所有meta文件以日志形式写入设定目录。用户需要将meta信息上送至kafka通道内。
- **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta),设置event成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报meta信息。
+ **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置event成kafka上报方式,并配置kafka_topic,gala-gopher就会以kafka client方式工作,周期性上报meta信息。
### 扩展数据采集范围
@@ -108,13 +108,13 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr
- **定义观测实体**
-通过定义观测实体(或者更新原观测实体)用于承载新增采集metrics数据。用户通过meta文件(规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#122-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9A%84meta%E6%96%87%E4%BB%B6))定义观测实体的key、label(可选)、metrics,定义完成后,将meta文件归档在[探针目录](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#23-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9B%AE%E5%BD%95)。
+通过定义观测实体(或者更新原观测实体)用于承载新增采集metrics数据。用户通过meta文件(参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#2-%E5%AE%9A%E4%B9%89meta%E6%96%87%E4%BB%B6))定义观测实体的key、label(可选)、metrics,定义完成后,将meta文件归档在[探针目录](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#%E5%BC%80%E5%8F%91%E8%A7%86%E5%9B%BE)。
- **集成数据探针**
-用户可以通过各种编程语言(shell、python、java等)包装数据采集软件,并在脚本中按照meta文件定义[格式](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#123-%E8%BE%93%E5%87%BA%E6%8E%A2%E9%92%88%E6%8C%87%E6%A0%87)将采集到的数据通过linux管道符形式输出。
+用户可以通过各种编程语言(shell、python、java等)包装数据采集软件,并在脚本中按照meta文件定义格式将采集到的数据通过linux管道符形式输出,参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#3-%E8%BE%93%E5%87%BA%E6%8E%A2%E9%92%88%E6%8C%87%E6%A0%87-1)。
-参考:[cAdvisor](https://gitee.com/openeuler/gala-gopher/tree/master/src/probes/extends/python.probe/cadvisor.probe)第三方探针集成案例。
+参考[cAdvisor第三方探针集成案例](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#%E5%A6%82%E4%BD%95%E6%96%B0%E5%A2%9Eextends%E6%8E%A2%E9%92%88)。
## gala-spider
diff --git a/gopher_tech_abnormal.md b/gopher_tech_abnormal.md
index 1e8abe75f4cff272b1a21319e2f05a42089a3ffd..6578764f2334e78abaccda338399e37a908adedd 100644
--- a/gopher_tech_abnormal.md
+++ b/gopher_tech_abnormal.md
@@ -1,66 +1,95 @@
-# TCP(entity_name:tcp_link)
+# gala-gopher系统异常事件
-| metrics_name | description | param | level |
-| ------------- | ------------------------------------------ | ----------------- | ----- |
-| tcp_oom | TCP out of memory(%u). | P1: error count | WARN |
-| backlog_drops | TCP backlog queue drops(%u). | P1: drops count | WARN |
-| filter_drops | TCP filter drops(%u). | P1: drops count | WARN |
-| syn_srtt | TCP connection establish timed out(%u us). | P1: syn rtt times | WARN |
+## 简介
-# ENDPOINT
+gala-gopher提供系统异常检测能力,支持用户在启动各个探针的时候,通过阈值(包括上下限)设置异常范围,探针会根据阈值判断某个指标是否异常,如果异常则上报异常事件。
-| metrics_name | description | param | level |
-| ------------------- | ------------------------------- | ------------------ | ----- |
-| listendrop | TCP listen drops(%lu). | P1: drops count | WARN |
-| accept_overflow | TCP accept queue overflow(%lu). | P1: overflow count | WARN |
-| syn_overflow | TCP syn queue overflow(%lu). | P1: overflow count | WARN |
-| passive_open_failed | TCP passive open failed(%lu). | P1: failed count | WARN |
-| active_open_failed | TCP active open failed(%lu). | P1: failed count | WARN |
-| bind_rcv_drops | UDP(S) queue drops(%lu). | P1: drops count | WARN |
-| udp_rcv_drops | UDP(C) queue drops(%lu). | P1: drops count | WARN |
+## 如何开启异常事件
+- 支持异常事件的探针参考[支持的异常事件](#支持的异常事件)。
+- 探针启动参数开启异常事件上报 `-l WARN` 。
+- 设置阈值,比如:设置资源利用率上限为80% `-U 80`,设置资源利用率下限为5% `-L 5` 。
+> 注:异常事件开关、阈值通过探针启动参数传递,探针启动参数参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E5%90%AF%E5%8A%A8%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D)。
-# THREAD(entity_name:task)
+## 支持的异常事件
-| metrics_name | description | param | level |
-| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- |
-| off_cpu_ns | Process(COMM:%s TID:%d) is preempted(COMM:%s PID:%d) and off-CPU %llu ns. | P1: process name P2: process id P3: process name P4: process id P5: off-cpu times | WARN |
-| iowait_us | Process(COMM:%s TID:%d) iowait %llu us. | P1: process name P2: process id P3: io-wait times | WARN |
-| hang_count | Process(COMM:%s TID:%d) io hang %u. | P1: process name P2: process id P3: error count | WARN |
-| bio_err_count | Process(COMM:%s TID:%d) bio error %u. | P1: process name P2: process id P3: error count | WARN |
+本章以观测实体(`entity_name`)的粒度来介绍其支持的异常事件。
-# Process
+### TCP_LINK
-| metrics_name | description | param | level |
-| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- |
-| syscall_failed | Process(COMM:%s PID:%u) syscall failed(SysCall-ID:%d RET:%d COUNT:%u). | P1: process name P2: process id P3: syscall no P4: syscall ret-code P5 failed count | WARN |
-| gethostname_failed | Process(COMM:%s PID:%u) gethostname failed(COUNT:%u). | P1: process name P2: process id P3 failed count | WARN |
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| ------------- | ------------------------------------------ | ----------------- | -------- | -------- |
+| tcp_oom | TCP out of memory(%u). | P1: error count | NA | WARN |
+| backlog_drops | TCP backlog queue drops(%u). | P1: drops count | [-D <>] | WARN |
+| filter_drops | TCP filter drops(%u). | P1: drops count | [-D <>] | WARN |
+| syn_srtt | TCP connection establish timed out(%u us). | P1: syn rtt times | [-T <>] | WARN |
-# BLOCK
+> 注:输入参数为NA表示不需要外部输入阈值参数,内部实现是根据指标值是否为0判断异常与否。
-| metrics_name | description | param | level |
-| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- |
-| count_iscsi_err | Iscsi errors(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | WARN |
-| count_iscsi_tmout | Iscsi timeout(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | WARN |
-| latency_flush_jitter | Jitter latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush jitter latency, unit is us P2: block name P3: disk name | WARN |
-| latency_flush_max | Latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush latency, unit is us P2: block name P3: disk name | WARN |
-| latency_req_jitter | Jitter latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request jitter latency, unit is us P2: block name P3: disk name | WARN |
-| latency_req_max | Latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request latency, unit is us P2: block name P3: disk name | WARN |
+### ENDPOINT
-# DISK
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| ------------------- | ------------------------------- | ------------------ | -------- | -------- |
+| listendrop | TCP listen drops(%lu). | P1: drops count | NA | WARN |
+| accept_overflow | TCP accept queue overflow(%lu). | P1: overflow count | NA | WARN |
+| syn_overflow | TCP syn queue overflow(%lu). | P1: overflow count | NA | WARN |
+| passive_open_failed | TCP passive open failed(%lu). | P1: failed count | NA | WARN |
+| active_open_failed | TCP active open failed(%lu). | P1: failed count | NA | WARN |
+| bind_rcv_drops | UDP(S) queue drops(%lu). | P1: drops count | NA | WARN |
+| udp_rcv_drops | UDP(C) queue drops(%lu). | P1: drops count | NA | WARN |
-| metrics_name | description | param | level |
-| --------------- | ------------------------------- | -------------- | ----- |
-| inode_userd_per | Too many Inodes consumed(%d%%). | P1: Percentage | WARN |
-| block_userd_per | Too many Blocks used(%d%%). | P1: Percentage | WARN |
-| iostat_util | Disk device saturated(%.2f%%). | P1: Percentage | WARN |
+### THREAD
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | -------- |
+| off_cpu_ns | Process(COMM:%s TID:%d) is preempted(COMM:%s PID:%d) and off-CPU %llu ns. | P1: process name P2: process id P3: process name P4: process id P5: off-cpu times | NA | WARN |
+| iowait_us | Process(COMM:%s TID:%d) iowait %llu us. | P1: process name P2: process id P3: io-wait times | [-T <>] | WARN |
+| hang_count | Process(COMM:%s TID:%d) io hang %u. | P1: process name P2: process id P3: error count | NA | WARN |
+| bio_err_count | Process(COMM:%s TID:%d) bio error %u. | P1: process name P2: process id P3: error count | NA | WARN |
+### PROC
-# NET
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | -------- |
+| syscall_failed | Process(COMM:%s PID:%u) syscall failed(SysCall-ID:%d RET:%d COUNT:%u). | P1: process name P2: process id P3: syscall no P4: syscall ret-code P5 failed count | NA | WARN |
+| gethostname_failed | Process(COMM:%s PID:%u) gethostname failed(COUNT:%u). | P1: process name P2: process id P3 failed count | NA | WARN |
-| metrics_name | description | param | level |
-| ------------------- | -------------------------------- | --------------- | ----- |
-| net_device_tx_drops | net device tx queue drops(%llu). | P1: drops count | WARN |
-| net_device_rx_drops | net device rx queue drops(%llu). | P1: drops count | WARN |
\ No newline at end of file
+### BLOCK
+
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | :------- |
+| count_iscsi_err | Iscsi errors(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | NA | WARN |
+| count_iscsi_tmout | Iscsi timeout(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | NA | WARN |
+| latency_flush_jitter | Jitter latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush jitter latency, unit is us P2: block name P3: disk name | [-J <>] | WARN |
+| latency_flush_max | Latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush latency, unit is us P2: block name P3: disk name | [-T <>] | WARN |
+| latency_req_jitter | Jitter latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request jitter latency, unit is us P2: block name P3: disk name | [-J <>] | WARN |
+| latency_req_max | Latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request latency, unit is us P2: block name P3: disk name | [-T <>] | WARN |
+
+### DISK
+
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| ----------- | ------------------------------ | -------------- | -------- | -------- |
+| iostat_util | Disk device saturated(%.2f%%). | P1: Percentage | [-U <>] | WARN |
+
+### DF
+
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| --------------- | ------------------------------- | -------------- | -------- | -------- |
+| inode_userd_per | Too many Inodes consumed(%d%%). | P1: Percentage | [-U <>] | WARN |
+| block_userd_per | Too many Blocks used(%d%%). | P1: Percentage | [-U <>] | WARN |
+
+### NIC
+
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| -------------------- | --------------------------------- | ---------------- | -------- | -------- |
+| net_device_tx_drops | net device tx queue drops(%llu). | P1: drops count | [-D <>] | WARN |
+| net_device_rx_drops | net device rx queue drops(%llu). | P1: drops count | [-D <>] | WARN |
+| net_device_tx_errors | net device tx queue errors(%llu). | P1: errors count | [-D <>] | WARN |
+| net_device_rx_errs | net device tx queue errors(%llu). | P1: errors count | [-D <>] | WARN |
+
+### CPU
+
+| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |
+| ---------- | --------------------------------- | -------------- | -------- | -------- |
+| used_per | Too high cpu utilization(%.2f%%). | P1: Percentage | [-U <>] | WARN |
\ No newline at end of file