diff --git a/gopher_tech_abnormal.md b/gopher_tech_abnormal.md index 3eb299241915e7e23ea1691c593345974b974639..c37d4c33b29d0e05d1aff0147aea6fcea2a9eac5 100644 --- a/gopher_tech_abnormal.md +++ b/gopher_tech_abnormal.md @@ -22,6 +22,51 @@ gala-gopher提供系统异常检测能力,支持用户在启动各个探针的 | -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | -------- | | rtt_nsec(Redis/PG) | Process(TID:%d, CIP(%s:%u), SIP(%s:%u)) SLI(%s:%llu) exceed the threshold. | P1: process ID P2: client ip and port P3: server ip and port P4 command and SLI | [-T <>] | WARN | +#### 输出格式 + +```json +{ + "Timestamp": , + "event_id": "____", # tgid-应用进程号 fd-应用的socket文件描述符 + "Attributes": { + "entity_id": "___", + "event_id": "____", + "event_type": "sys" # sys-表示异常事件类型为系统级 + }, + "Resource": { + "metrics": "gala_gopher_sli_" # event_name-异常事件名,参考上表第一列 + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity() Process(TID:, CIP(:), SIP(:)) SLI(:) exceed the threshold." # entity_id-_ rtt_nsec-响应时延,单位ns +} +``` + +#### 输出示例 + +- ##### rtt_nsec + + sli探针监控并统计具体应用(如Redis)的响应时延,当监测到响应时延超过阈值时上报异常事件。用户需要在启动gala-gopher前手动通过`-T`参数设置阈值。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_sli_3739183_23", + "Attributes": { + "entity_id": "e473b23xxx_sli_3739183_23", + "event_id": "1661593284000_e473b23xxx_sli_3739183_23", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_sli_rtt_nsec" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_23) Process(TID:3739183, CIP(9.82.194.97:48202), SIP(9.82.206.89:3742)) SLI(SET:48678) exceed the threshold." + # 示例的事件信息包含了Redis应用的进程号、发起请求的客户端以及redis-server的IP和port、具体的请求命令和本次响应时延(单位ns) + } + ``` + ### TCP_LINK | 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | @@ -33,6 +78,117 @@ gala-gopher提供系统异常检测能力,支持用户在启动各个探针的 > 注:输入参数为NA表示不需要外部输入阈值参数,内部实现是根据指标值是否为0判断异常与否。 +#### 输出格式 + +```json +{ + "Timestamp": , + "event_id": "_________", + # role-客户端/服务端 c_ip/s_ip-对端IP/本地IP c_port/s_port-对端端口/本地端口 family-协议族如IPv4 + "Attributes": { + "entity_id": "________", + "event_id": "_________", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_tcp_link_" # event_name-异常事件名,参见上表中第一列 + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity() ." # descriptions-事件信息,参见上表第二列 +} +``` + +#### 输出示例 + +- ##### tcp_oom + + tcpprobe探针会记录TCP内存占用量超过了系统设定的最大值的次数,也即检测到该条TCP连接内存不足的次数。该异常事件的阈值不需要用户手动配置,默认为0,即只要检测到TCP内存不足则上报异常事件。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "Attributes": { + "entity_id": "e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_tcp_link_tcp_oom" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_0_9.82.194.97_9.82.206.89_78202_3742_2) TCP out of memory(1)." # 句尾括号中“1”表示检测到oom的次数为1 + } + ``` + +- ##### backlog_drops + + tcpprobe探针会记录因backlog队列满导致的丢包次数,若判断该丢包次数大于用户设定的阈值则上报异常事件。用户需要在启动gala-gopher前通过`-D`参数设定该阈值的值,不设定该值则此项异常事件关闭。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "Attributes": { + "entity_id": "e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_tcp_link_backlog_drops" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_0_9.82.194.97_9.82.206.89_78202_3742_2) TCP backlog queue drops(3)." # 示例中设定阈值为2,句尾括号中“3”表示检测到此类丢包的次数为3 + } + ``` + +- ##### filter_drops + + tcpprobe探针会记录经过过滤器filter丢弃的包数,当判断该丢包数大于用户设定的阈值则上报异常事件。用户需要在启动gala-gopher前通过`-D`参数设定该阈值的值,不设定该值则此项异常事件关闭。和backlog_drops异常事件共用同一个参数。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "Attributes": { + "entity_id": "e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_tcp_link_filter_drops" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_0_9.82.194.97_9.82.206.89_78202_3742_2) TCP filter drops(4)." # 句尾括号中“4”表示检测到此类丢包的次数为4 + } + ``` + +- ##### syn_srtt + + tcpprobe会记录服务端接收到SYN请求后,从发送SYN/ACK到接收到客户端ACK经过的时间,若判断该建链时间大于用户设置的阈值则上报异常事件。用户需要在启动gala-gopher之前通过`-T`参数设置阈值。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "Attributes": { + "entity_id": "e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_id": "1661593284000_e473b23xxx_tcp_link_3739183_0_9.82.194.97_9.82.206.89_78202_3742_2", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_tcp_link_syn_srtt" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_0_9.82.194.97_9.82.206.89_78202_3742_2) TCP connection establish timed out(10 us)." + } + ``` + ### ENDPOINT | 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 | @@ -45,6 +201,184 @@ gala-gopher提供系统异常检测能力,支持用户在启动各个探针的 | bind_rcv_drops | UDP(S) queue drops(%lu). | P1: drops count | NA | WARN | | udp_rcv_drops | UDP(C) queue drops(%lu). | P1: drops count | NA | WARN | +#### 输出格式 + +```json +{ + "Timestamp": , + "event_id": "______", + # s_ip和s_port可能为空,具体参考示例 table_name-具体类型为listen|connect|bind|udp + "Attributes": { + "entity_id": "_____", + "event_id": "______", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_" # event_name-异常事件名,参见上表中第一列 + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity() ." # descriptions-事件信息,参见上表第二列 + # entity_id-___ +} +``` + +#### 输出示例 + +- ##### listendrop + + 当检测到TCP accept丢弃次数大于0则上报异常事件。该异常事件的阈值不需要用户手动配置,默认为0。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "Attributes": { + "entity_id": "e473b23xxx_endpoint_3739183_*_3742_listen", + " event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_listendrop" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_*_3742_listen) TCP listen drops(1)." # 句尾括号中“1”即丢包数 + } + ``` + +- ##### accept_overflow + + 当检测到TCP 全连接队列溢出的次数大于0则上报异常事件。该异常事件的阈值不需要用户手动配置,默认为0。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "Attributes": { + "entity_id": "e473b23xxx_endpoint_3739183_*_3742_listen", + " event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_accept_overflow" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_*_3742_listen) TCP accept queue overflow(1)." + } # 句尾括号中“1”表示队列溢出次数 + ``` + +- ##### syn_overflow + + 当检测到TCP 半连接队列溢出的次数大于0则上报异常事件。该异常事件的阈值不需要用户手动配置,默认为0。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "Attributes": { + "entity_id": "e473b23xxx_endpoint_3739183_*_3742_listen", + " event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_syn_overflow" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_*_3742_listen) TCP syn queue overflow(1)." + } + ``` + +- ##### passive_open_failed + + 当检测到TCP 被动发起的建链失败次数大于0则上报异常事件。该异常事件的阈值不需要用户手动配置,默认为0。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "Attributes": { + "entity_id": "e473b23xxx_endpoint_3739183_*_3742_listen", + " event_id": "1661593284000_e473b23xxx_endpoint_3739183_*_3742_listen", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_passive_open_failed" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_*_3742_listen) TCP passive open failed(1)." + } # 句尾括号中“1”表示失败次数 + ``` + +- ##### active_open_failed + + 当检测到TCP 主动发起的建链失败次数大于0则上报异常事件。该异常事件的阈值不需要用户手动配置,默认为0。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_endpoint_3739183_9.82.206.89_0_connect", + "Attributes": { + "entity_id": "e473b23xxx_endpoint_3739183_9.82.206.89_0_connect", + " event_id": "1661593284000_e473b23xxx_endpoint_3739183_9.82.206.89_0_connect", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_active_open_failed" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_9.82.206.89_0_connect) TCP active open failed(1)." + } + ``` + +- ##### bind_rcv_drops + + 当检测到UDP服务端接收失败的次数大于0则上报异常事件。该异常事件的阈值不需要用户手动配置,默认为0。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_endpoint_3739183_9.82.206.89_0_bind", + "Attributes": { + "entity_id": "e473b23xxx_endpoint_3739183_9.82.206.89_0_bind", + " event_id": "1661593284000_e473b23xxx_endpoint_3739183_9.82.206.89_0_bind", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_bind_rcv_drops" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_9.82.206.89_0_bind) UDP(S) queue drops(1)." + } + ``` + +- ##### udp_rcv_drops + + 当检测到UDP客户端接收失败的次数大于0则上报异常事件。该异常事件的阈值不需要用户手动配置,默认为0。 + + ```json + { + "Timestamp": 1661593284000, + "event_id": "1661593284000_e473b23xxx_endpoint_3739183_9.82.206.89_0_udp", + "Attributes": { + "entity_id": "e473b23xxx_endpoint_3739183_9.82.206.89_0_udp", + " event_id": "1661593284000_e473b23xxx_endpoint_3739183_9.82.206.89_0_udp", + "event_type": "sys" + }, + "Resource": { + "metrics": "gala_gopher_endpoint_udp_rcv_drops" + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "Sat Aug 27 17:41:24 2022 WARN Entity(3739183_9.82.206.89_0_udp) UDP(C) queue drops(1)." + } + ``` + ### THREAD | 异常事件名 | 事件信息 | 输出参数 | 输入参数 | 异常等级 |