From 22b6377e12113b07093e44735cdb3a60c259bfc6 Mon Sep 17 00:00:00 2001 From: dowzyx Date: Mon, 19 Sep 2022 21:31:13 +0800 Subject: [PATCH] docs(gopher): modify README and gopher_tech_abnormal docs --- README.md | 22 +++---- gopher_tech_abnormal.md | 127 ++++++++++++++++++++++++---------------- 2 files changed, 89 insertions(+), 60 deletions(-) diff --git a/README.md b/README.md index f8db51e..a24ea37 100644 --- a/README.md +++ b/README.md @@ -68,9 +68,9 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr **术语** - **探针**:gala-gopher内执行具体数据采集任务的程序,包括native、extend 2类探针,前者以线程方式单独启动数据采集任务,后者以子进程方式启动数据采集任务。gala-gopher可以通过配置修改的方式启动部分或全部探针。 -- **观测实体(entity_name)**:用来定义系统内的观测对象,所有探针采集的数据均会归属到具体的某个观测实体。每种观测实体均有key、label(可选)、metrics组成,比如tcp_link观测实体的key包括进程号、IP五元组、协议族等信息,metrics则包括tx、rx、rtt等运行状态指标。除原生支持的[观测实体](https://gitee.com/openeuler/gala-docs#%E8%A7%82%E6%B5%8B%E5%AE%9E%E4%BD%93),gala-gopher也可以扩展观测实体。 +- **观测实体(entity_name)**:用来定义系统内的观测对象,所有探针采集的数据均会归属到具体的某个观测实体。每种观测实体均有key、label(可选)、metrics组成,比如tcp_link观测实体的key包括进程号、IP五元组、协议族等信息,metrics则包括tx、rx、rtt等运行状态指标。除原生支持的[观测实体](#观测实体),gala-gopher也可以扩展观测实体。 - **数据表(table_name)**:观测实体由1张或更多数据表组合而成,通常1张数据表由1个采集任务完成,由此可知单个观测实体可以由多个采集任务共同完成。 -- **meta文件**:通过文件定义观测实体(包括内部的数据表),系统内meta文件必须保证唯一,定义不可冲突。规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#122-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9A%84meta%E6%96%87%E4%BB%B6)。 +- **meta文件**:通过文件定义观测实体(包括内部的数据表),系统内meta文件必须保证唯一,定义不可冲突。规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#meta%E6%96%87%E4%BB%B6%E5%AE%9A%E4%B9%89%E8%A7%84%E8%8C%83)。 ### 支持的技术 @@ -86,21 +86,21 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr - **metrics集成方式** - **prometheus exporter方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric),设置metrics成web上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#webserver%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以prometheus exporter方式工作,被动响应metrics数据GET请求。 + **prometheus exporter方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置metric成web上报方式,并修改配置文件中web_server部分,gala-gopher就会以prometheus exporter方式工作,被动响应metrics数据GET请求。 - **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric),设置metrics成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报metrics。用户需将metrics数据转移至prometheus内。 + **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置metrics成kafka上报方式,并配置kafka_topic ,gala-gopher就会以kafka client方式工作,周期性上报metrics。用户需将metrics数据转移至prometheus内。 - **event集成方式** - **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event),设置event成logs上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以logs方式工作,将event以日志形式写入设定目录。用户可以通过读取该目录文件,获取gala-gopher上报的event信息并上送至kafka通道内。 + **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置event成logs上报方式,并通过logs部分配置日志路径,gala-gopher就会以logs方式工作,将event以日志形式写入设定目录。用户可以通过读取该目录文件,获取gala-gopher上报的event信息并上送至kafka通道内。 - **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event),设置event成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报event。 + **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置event成kafka上报方式,并配置kafka_topic,gala-gopher就会以kafka client方式工作,周期性上报event。 - **meta文件集成方式** - **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta),设置meta成logs上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以logs方式工作,将gala-gopher集成的所有meta文件以日志形式写入设定目录。用户需要将meta信息上送至kafka通道内。 + **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置meta成logs上报方式,并通过logs部分配置日志路径,gala-gopher就会以logs方式工作,将gala-gopher集成的所有meta文件以日志形式写入设定目录。用户需要将meta信息上送至kafka通道内。 - **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta),设置event成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报meta信息。 + **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E9%85%8D%E7%BD%AE%E4%BB%8B%E7%BB%8D),设置event成kafka上报方式,并配置kafka_topic,gala-gopher就会以kafka client方式工作,周期性上报meta信息。 ### 扩展数据采集范围 @@ -108,13 +108,13 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr - **定义观测实体** -通过定义观测实体(或者更新原观测实体)用于承载新增采集metrics数据。用户通过meta文件(规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#122-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9A%84meta%E6%96%87%E4%BB%B6))定义观测实体的key、label(可选)、metrics,定义完成后,将meta文件归档在[探针目录](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#23-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9B%AE%E5%BD%95)。 +通过定义观测实体(或者更新原观测实体)用于承载新增采集metrics数据。用户通过meta文件(参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#2-%E5%AE%9A%E4%B9%89meta%E6%96%87%E4%BB%B6))定义观测实体的key、label(可选)、metrics,定义完成后,将meta文件归档在[探针目录](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#%E5%BC%80%E5%8F%91%E8%A7%86%E5%9B%BE)。 - **集成数据探针** -用户可以通过各种编程语言(shell、python、java等)包装数据采集软件,并在脚本中按照meta文件定义[格式](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#123-%E8%BE%93%E5%87%BA%E6%8E%A2%E9%92%88%E6%8C%87%E6%A0%87)将采集到的数据通过linux管道符形式输出。 +用户可以通过各种编程语言(shell、python、java等)包装数据采集软件,并在脚本中按照meta文件定义格式将采集到的数据通过linux管道符形式输出,参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#3-%E8%BE%93%E5%87%BA%E6%8E%A2%E9%92%88%E6%8C%87%E6%A0%87-1)。 -参考:[cAdvisor](https://gitee.com/openeuler/gala-gopher/tree/master/src/probes/extends/python.probe/cadvisor.probe)第三方探针集成案例。 +参考[cAdvisor第三方探针集成案例](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#%E5%A6%82%E4%BD%95%E6%96%B0%E5%A2%9Eextends%E6%8E%A2%E9%92%88)。 ## gala-spider diff --git a/gopher_tech_abnormal.md b/gopher_tech_abnormal.md index 1e8abe7..6578764 100644 --- a/gopher_tech_abnormal.md +++ b/gopher_tech_abnormal.md @@ -1,66 +1,95 @@ -# TCP(entity_name:tcp_link) +# gala-gopher系统异常事件 -| metrics_name | description | param | level | -| ------------- | ------------------------------------------ | ----------------- | ----- | -| tcp_oom | TCP out of memory(%u). | P1: error count | WARN | -| backlog_drops | TCP backlog queue drops(%u). | P1: drops count | WARN | -| filter_drops | TCP filter drops(%u). | P1: drops count | WARN | -| syn_srtt | TCP connection establish timed out(%u us). | P1: syn rtt times | WARN | +## 简介 -# ENDPOINT +gala-gopher提供系统异常检测能力,支持用户在启动各个探针的时候,通过阈值(包括上下限)设置异常范围,探针会根据阈值判断某个指标是否异常,如果异常则上报异常事件。 -| metrics_name | description | param | level | -| ------------------- | ------------------------------- | ------------------ | ----- | -| listendrop | TCP listen drops(%lu). | P1: drops count | WARN | -| accept_overflow | TCP accept queue overflow(%lu). | P1: overflow count | WARN | -| syn_overflow | TCP syn queue overflow(%lu). | P1: overflow count | WARN | -| passive_open_failed | TCP passive open failed(%lu). | P1: failed count | WARN | -| active_open_failed | TCP active open failed(%lu). | P1: failed count | WARN | -| bind_rcv_drops | UDP(S) queue drops(%lu). | P1: drops count | WARN | -| udp_rcv_drops | UDP(C) queue drops(%lu). | P1: drops count | WARN | +## 如何开启异常事件 +- 支持异常事件的探针参考[支持的异常事件](#支持的异常事件)。 +- 探针启动参数开启异常事件上报 `-l WARN` 。 +- 设置阈值,比如:设置资源利用率上限为80% `-U 80`,设置资源利用率下限为5% `-L 5` 。 +> 注:异常事件开关、阈值通过探针启动参数传递,探针启动参数参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#%E5%90%AF%E5%8A%A8%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D)。 -# THREAD(entity_name:task) +## 支持的异常事件 -| metrics_name | description | param | level | -| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- | -| off_cpu_ns | Process(COMM:%s TID:%d) is preempted(COMM:%s PID:%d) and off-CPU %llu ns. | P1: process name P2: process id P3: process name P4: process id P5: off-cpu times | WARN | -| iowait_us | Process(COMM:%s TID:%d) iowait %llu us. | P1: process name P2: process id P3: io-wait times | WARN | -| hang_count | Process(COMM:%s TID:%d) io hang %u. | P1: process name P2: process id P3: error count | WARN | -| bio_err_count | Process(COMM:%s TID:%d) bio error %u. | P1: process name P2: process id P3: error count | WARN | +本章以观测实体(`entity_name`)的粒度来介绍其支持的异常事件。 -# Process +### TCP_LINK -| metrics_name | description | param | level | -| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- | -| syscall_failed | Process(COMM:%s PID:%u) syscall failed(SysCall-ID:%d RET:%d COUNT:%u). | P1: process name P2: process id P3: syscall no P4: syscall ret-code P5 failed count | WARN | -| gethostname_failed | Process(COMM:%s PID:%u) gethostname failed(COUNT:%u). | P1: process name P2: process id P3 failed count | WARN | +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| ------------- | ------------------------------------------ | ----------------- | -------- | -------- | +| tcp_oom | TCP out of memory(%u). | P1: error count | NA | WARN | +| backlog_drops | TCP backlog queue drops(%u). | P1: drops count | [-D <>] | WARN | +| filter_drops | TCP filter drops(%u). | P1: drops count | [-D <>] | WARN | +| syn_srtt | TCP connection establish timed out(%u us). | P1: syn rtt times | [-T <>] | WARN | -# BLOCK +> 注:输入参数为NA表示不需要外部输入阈值参数,内部实现是根据指标值是否为0判断异常与否。 -| metrics_name | description | param | level | -| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----- | -| count_iscsi_err | Iscsi errors(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | WARN | -| count_iscsi_tmout | Iscsi timeout(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | WARN | -| latency_flush_jitter | Jitter latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush jitter latency, unit is us P2: block name P3: disk name | WARN | -| latency_flush_max | Latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush latency, unit is us P2: block name P3: disk name | WARN | -| latency_req_jitter | Jitter latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request jitter latency, unit is us P2: block name P3: disk name | WARN | -| latency_req_max | Latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request latency, unit is us P2: block name P3: disk name | WARN | +### ENDPOINT -# DISK +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| ------------------- | ------------------------------- | ------------------ | -------- | -------- | +| listendrop | TCP listen drops(%lu). | P1: drops count | NA | WARN | +| accept_overflow | TCP accept queue overflow(%lu). | P1: overflow count | NA | WARN | +| syn_overflow | TCP syn queue overflow(%lu). | P1: overflow count | NA | WARN | +| passive_open_failed | TCP passive open failed(%lu). | P1: failed count | NA | WARN | +| active_open_failed | TCP active open failed(%lu). | P1: failed count | NA | WARN | +| bind_rcv_drops | UDP(S) queue drops(%lu). | P1: drops count | NA | WARN | +| udp_rcv_drops | UDP(C) queue drops(%lu). | P1: drops count | NA | WARN | -| metrics_name | description | param | level | -| --------------- | ------------------------------- | -------------- | ----- | -| inode_userd_per | Too many Inodes consumed(%d%%). | P1: Percentage | WARN | -| block_userd_per | Too many Blocks used(%d%%). | P1: Percentage | WARN | -| iostat_util | Disk device saturated(%.2f%%). | P1: Percentage | WARN | +### THREAD +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | -------- | +| off_cpu_ns | Process(COMM:%s TID:%d) is preempted(COMM:%s PID:%d) and off-CPU %llu ns. | P1: process name P2: process id P3: process name P4: process id P5: off-cpu times | NA | WARN | +| iowait_us | Process(COMM:%s TID:%d) iowait %llu us. | P1: process name P2: process id P3: io-wait times | [-T <>] | WARN | +| hang_count | Process(COMM:%s TID:%d) io hang %u. | P1: process name P2: process id P3: error count | NA | WARN | +| bio_err_count | Process(COMM:%s TID:%d) bio error %u. | P1: process name P2: process id P3: error count | NA | WARN | +### PROC -# NET +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | -------- | +| syscall_failed | Process(COMM:%s PID:%u) syscall failed(SysCall-ID:%d RET:%d COUNT:%u). | P1: process name P2: process id P3: syscall no P4: syscall ret-code P5 failed count | NA | WARN | +| gethostname_failed | Process(COMM:%s PID:%u) gethostname failed(COUNT:%u). | P1: process name P2: process id P3 failed count | NA | WARN | -| metrics_name | description | param | level | -| ------------------- | -------------------------------- | --------------- | ----- | -| net_device_tx_drops | net device tx queue drops(%llu). | P1: drops count | WARN | -| net_device_rx_drops | net device rx queue drops(%llu). | P1: drops count | WARN | \ No newline at end of file +### BLOCK + +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | :------- | +| count_iscsi_err | Iscsi errors(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | NA | WARN | +| count_iscsi_tmout | Iscsi timeout(%llu) occured on Block(%s, disk %s). | P1: block name P2: disk name | NA | WARN | +| latency_flush_jitter | Jitter latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush jitter latency, unit is us P2: block name P3: disk name | [-J <>] | WARN | +| latency_flush_max | Latency of flush operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:flush latency, unit is us P2: block name P3: disk name | [-T <>] | WARN | +| latency_req_jitter | Jitter latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request jitter latency, unit is us P2: block name P3: disk name | [-J <>] | WARN | +| latency_req_max | Latency of request operation(%llu) exceeded threshold, occured on Block(%s, disk %s). | P1:request latency, unit is us P2: block name P3: disk name | [-T <>] | WARN | + +### DISK + +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| ----------- | ------------------------------ | -------------- | -------- | -------- | +| iostat_util | Disk device saturated(%.2f%%). | P1: Percentage | [-U <>] | WARN | + +### DF + +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| --------------- | ------------------------------- | -------------- | -------- | -------- | +| inode_userd_per | Too many Inodes consumed(%d%%). | P1: Percentage | [-U <>] | WARN | +| block_userd_per | Too many Blocks used(%d%%). | P1: Percentage | [-U <>] | WARN | + +### NIC + +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| -------------------- | --------------------------------- | ---------------- | -------- | -------- | +| net_device_tx_drops | net device tx queue drops(%llu). | P1: drops count | [-D <>] | WARN | +| net_device_rx_drops | net device rx queue drops(%llu). | P1: drops count | [-D <>] | WARN | +| net_device_tx_errors | net device tx queue errors(%llu). | P1: errors count | [-D <>] | WARN | +| net_device_rx_errs | net device tx queue errors(%llu). | P1: errors count | [-D <>] | WARN | + +### CPU + +| 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | +| ---------- | --------------------------------- | -------------- | -------- | -------- | +| used_per | Too high cpu utilization(%.2f%%). | P1: Percentage | [-U <>] | WARN | \ No newline at end of file -- Gitee