diff --git a/README.md b/README.md index c2e09624b9af8e7e5a309d8a6b58550226c69ef0..7eef3c10160c1b8d59e2b3af284493edd2b3dbdb 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,7 @@ gala-ops具备四个组件(gala-gopher、gala-spider、gala-anteater、gala-in 集群部署模式下,gala-ops四个组件需要协同起来工作,主要依赖kafka、arangodb等软件。下图是系统集成关系图,通常将gala-gopher部署于生产环境中,其他组件(包括kafka、prometheus、arangodb等中间件)部署于管理面,用户需要确保管理面内几个组件与中间件能够互通。 -gala-gopher与管理面可能无法直接互通,为此gala-gopher也提供多种[被集成方式](https://gitee.com/openeuler/gala-docs/blob/master/README.md "被集成方式")。 +gala-gopher与管理面可能无法直接互通,为此gala-gopher也提供多种被集成方式。 ![](./system_integration.png) @@ -78,7 +78,7 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr ### 支持的技术 -参考这里 +参考[这里](https://gitee.com/openeuler/gala-docs/blob/master/gopher_tech.md) ### 安装及使用 @@ -220,6 +220,12 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr |----------------------------------|---------------后端服务侧端口(server_port) + |----------------------------------调度系统(sched) + + |----------------------------------网络系统(net) + + |----------------------------------文件系统(fs) + |----------------------------------网卡(nic) |----------------------------------|---------------网卡队列(qdisc) @@ -228,16 +234,10 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr |----------------------------------|---------------磁盘逻辑卷/分区(block) - |----------------------------------网络(net) - - |----------------------------------文件系统(fs) - |----------------------------------cpu |----------------------------------内存(mem) - |----------------------------------调度(sched) - 上述部分观测实体介绍: - **NGINX(包括LVS/HAPROXY)**:在分布式应用场景中,通常会引入LoadBalancer,以实现业务流的弹性伸缩。通过提供LoadBalancer中间件的观测信息,可以更好的展示分布式应用的实时业务流拓扑。 @@ -262,9 +262,23 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr - 进程层:呈现进程实例之间的TCP流实时拓扑 - 主机层:呈现主机之间的TCP/IP实时拓扑(由进程级拓扑实时计算得出) -![](./horizontal_topology.png) +![](./app_horizontal_topology.png) + +​ 应用层水平拓扑 + + + +![](./proc_horizontal_topology.png) + +​ 进程层水平拓扑 + + + +![](./host_horizontal_topology.png) -水平拓扑API:? +​ 主机层水平拓扑 + +**注意**:水平拓扑只呈现应用、进程、主机三种观测实体。 ### 垂直拓扑 @@ -272,21 +286,47 @@ gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tr 通过选择水平拓扑内的对象(比如上图中应用实例、进程实例或Node实例)可以从不同对象实例视角观察垂直拓扑。 -下面从左至右分别给出应用实例、进程实例、Node实例视角的垂直拓扑效果。![](./vertical_topology.png) +![](./app_vertical_topology.png) + +​ 选中应用实例,进入应用实例垂直视图 + +**注意**:进程层内的观测实体只呈现进程、容器观测实体。 + +![](./proc_vertical_topology.png) + +​ 选中进程实例,进入进程实例垂直视图 + +**注意**:进程内的观测实体应呈现出进程、容器及其所有关联子对象。 + +![](./host_vertical_topology.png) + +​ 选中主机实例,进入主机实例垂直视图 -垂直拓扑API:? +**注意**:进程层内应呈现主机内所有进程、容器实例,但不呈现其关子对象。 +另外,为了让**水平、垂直拓扑视图可以自由切换**,垂直拓扑视图内可以选中任意应用实例、进程实例、Node实例跳转至水平拓扑视图。跳转后,**水平视图应该将用户选中的实体作为视窗中心**,并展现相应层次内所有实体对象及其关系。 +### 拓扑变化 -另外,为了让水平、垂直拓扑视图可以自由切换,垂直拓扑视图内可以选择任意应用实例、进程实例、Node实例跳转至水平拓扑。相应的API如下:? +系统运行过程中,水平/垂直拓扑会随着实时运行情况而刷新,比如某个观测实体下线(比如进程),在2小时(时间可设定)范围内,拓扑内依然能够看到,但是应将其设置成虚线,以示区分。 +![](./proc_horizontal_change.png) +​ 进程离线后,水平拓扑发生变化 + +### 查看观测实体状态 + +在垂直拓扑视图内,用户可以选中任意观测实体,查看观测实体运行状态。 + +![](./entity_detail.png) ## 异常检测 gala-ops具备2种异常检测能力:系统异常(也叫系统隐患)、应用异常。前者覆盖网络、磁盘I/O、调度、内存、文件系统等各类系统异常场景;后者包括常见应用时延KPI异常(包括redis、HTTP(S)、PG等),并且用户可以根据自身场景扩展KPI范围(要求KPI数据符合时序数据规范)。 -异常检测结果会标识出具体的观测实体,以及异常原因。用户可以通过kafka topic获取系统实时异常信息。通过例子我们解读下异常结果信息: +异常检测结果会标识出具体的观测实体,以及异常原因。用户可以通过kafka topic获取系统实时异常信息。 + +- 样例1:主机对象内block观测实体异常: ``` { @@ -321,7 +361,58 @@ gala-ops具备2种异常检测能力:系统异常(也叫系统隐患)、 为了更好的展示异常事件所处集群系统的位置,用户可以通过异常表格内的观测实体ID跳转至垂直拓扑视图,举例如下: -![](./host_hrchitecture.png) +![](./host_anomaly_detection.png) + +**注意**: + +- 如果观测实体属于主机层,垂直视图则应跳转至主机实例垂直视图,并且该观测实体应与周边实体颜色区分。 +- 用户选中存在异常的观测实体后,应跳出具体异常的metrics视图,并定位至出现异常数据的时间区间。 +- 水平视图应随即进入该主机为中心的水平拓扑视图。 + + + +- 样例2:进程对象内tcp_link观测实体异常: + +``` +{ + "Timestamp": 1586960586000000000, // 异常事件时间戳 + "event_id": "1586xxx_xxxx" // 异常事件ID + "Attributes": { + "entity_id": "xx", // 发生异常的观测实体ID(集群内唯一) + "event_id": "1586xxx_xxxx", // 异常事件ID(同上) + "event_type": "sys", // 异常事件类型(sys: 系统异常,app:应用异常) + "data": [....], // optional + "duration": 30, // optional + "occurred count": 6,// optional + }, + "Resource": { + "metrics": "gala_gopher_tcp_link_backlog_drops", // 产生异常的metrics + }, + "SeverityText": "WARN", // 异常级别 + "SeverityNumber": 13, // 异常级别编号 + "Body": "20200415T072306-0700 WARN Entity(xx) TCP backlog queue drops(13)." // 异常事件描述 +} +``` + + + +用户通过kafka订阅到异常事件后,可以表格化管理,以时间段形式呈现管理,如下: + +| 时间 | 异常事件ID | 观测实体ID | Metrics | 描述 | +| ----------------- | ------------ | ---------- | ---------------------------------- | ------------------------------------------------------------ | +| 11:23:54 CST 2022 | 1586xxx_xxxx | xxx_xxxx | gala_gopher_tcp_link_backlog_drops | 20200415T072306-0700 WARN Entity(xx) TCP backlog queue drops(13). | + +**注意**:一定时间段范围内,同一个观测实体可能会报重复上报相同异常(但事件ID不同)。所以需要基于**观测实体ID + Metrics** 去重处理。 + +为了更好的展示异常事件所处集群系统的位置,用户可以通过异常表格内的观测实体ID跳转至垂直拓扑视图,举例如下: + +![](./proc_anomaly_detection.png) + +**注意**: + +- 如果观测实体属于进程层,垂直视图则应跳转至具体进程实例的垂直视图,并且该观测实体应与周边实体颜色区分。 +- 用户选中存在异常的观测实体后,应跳出具体异常的metrics视图,并定位至出现异常数据的时间区间。 +- 水平视图应随即进入该进程为中心的水平拓扑视图。 ## 根因定位 diff --git a/app_horizontal_topology.png b/app_horizontal_topology.png new file mode 100644 index 0000000000000000000000000000000000000000..78f385317d5c1dcf6e4e27f1b212df1ccabc7030 Binary files /dev/null and b/app_horizontal_topology.png differ diff --git a/app_vertical_topology.png b/app_vertical_topology.png new file mode 100644 index 0000000000000000000000000000000000000000..5cf7f5838d1e2e336ef32aa475ce76827f9b74d5 Binary files /dev/null and b/app_vertical_topology.png differ diff --git a/entity_detail.png b/entity_detail.png new file mode 100644 index 0000000000000000000000000000000000000000..6a4ab02ff449e353ffb0cd11c0fc38979d9e68b3 Binary files /dev/null and b/entity_detail.png differ diff --git a/gopher_tech.md b/gopher_tech.md index c75cd749d372484ae02c34d51f5b1edcd988d57c..46b3982acc48ae03ac77b08bb5531a9ee94683ab 100644 --- a/gopher_tech.md +++ b/gopher_tech.md @@ -1,297 +1,297 @@ # TCP(entity_name:tcp_link) -| metrics_name | table_name | metrics_type | unit | metrics description | -| ------------------- | ----------- | ------------ | ------------------ | ------------------------------------------------------------ | -| tgid | | key | | 进程ID | -| role | | key | | 客户端/服务端 | -| client_ip | | key | | 客户端:本地IP;服务端:对端IP | -| server_ip | | key | | 客户端:对端IP;服务端:本地IP | -| client_port | | key | | 客户端:本地端口;服务端:对端端口 | -| server_port | | key | | 客户端:对端端口;服务端:本地端口 | -| protocol | | key | | 协议族(IPv4、IPv6) | -| rx_bytes | tcp_tx_rx | Gauge | bytes | rx bytes | -| tx_bytes | tcp_tx_rx | Gauge | bytes | tx bytes | -| rto | tcp_rate | Gauge | | Retransmission timeOut(us) | -| ato | tcp_rate | Gauge | | Estimated value of delayed ACK(us) | -| srtt | tcp_rtt | Gauge | us | Smoothed Round Trip Time(us). | -| snd_ssthresh | tcp_rate | Gauge | | Slow start threshold for congestion control. | -| rcv_ssthresh | tcp_rate | Gauge | | Current receive window size. | -| snd_cwnd | tcp_windows | Gauge | | Congestion Control Window Size. | -| advmss | tcp_rate | Gauge | | Local MSS upper limit. | -| reordering | tcp_windows | Gauge | | Segments to be reordered. | -| rcv_rtt | tcp_rtt | Gauge | us | Receive end RTT (unidirectional measurement). | -| rcv_space | tcp_rate | Gauge | | Current receive buffer size. | -| notsent_bytes | tcp_windows | Gauge | bytes | Number of bytes not sent currently. | -| notack_bytes | tcp_windows | Gauge | bytes | Number of bytes not ack currently. | -| snd_wnd | tcp_windows | Gauge | | Size of TCP send window. | -| rcv_wnd | tcp_windows | Gauge | | Size of TCP receive window. | -| delivery_rate | tcp_rate | Gauge | | Current transmit rate (multiple different from the actual value). | -| busy_time | tcp_rate | Gauge | | Time (jiffies) busy sending data. | -| rwnd_limited | tcp_rate | Gauge | | Time (jiffies) limited by receive window. | -| sndbuf_limited | tcp_rate | Gauge | | Time (jiffies) limited by send buffer. | -| pacing_rate | tcp_rate | Gauge | bytes per second | TCP pacing rate, bytes per second | -| max_pacing_rate | tcp_rate | Gauge | bytes per second | MAX TCP pacing rate, bytes per second | -| sk_err_que_size | tcp_sockbuf | Gauge | | Size of error queue in sock. | -| sk_rcv_que_size | tcp_sockbuf | Gauge | | Size of receive queue in sock. | -| sk_wri_que_size | tcp_sockbuf | Gauge | | Size of write queue in sock. | -| syn_srtt | tcp_srtt | Gauge | us | RTT of syn packet(us). | -| sk_backlog_size | tcp_sockbuf | Gauge | | Size of backlog queue in sock. | -| sk_omem_size | tcp_sockbuf | Gauge | | Size of omem in sock. | -| sk_forward_size | tcp_sockbuf | Gauge | | Size of forward in sock. | -| sk_wmem_size | tcp_sockbuf | Gauge | | Size of wmem in sock. | -| segs_in | tcp_tx_rx | Counter | segs | total number of segments received | -| segs_out | tcp_tx_rx | Counter | segs | total number of segments sent | -| retran_packets | tcp_abn | Gauge | | total number of retrans | -| backlog_drops | tcp_abn | Gauge | | drops caused by backlog queue full | -| sk_drops | tcp_abn | Counter | | tcp drop counter | -| lost_out | tcp_abn | Gauge | | tcp lost counter | -| sacked_out | tcp_abn | Gauge | | tcp sacked out counter | -| filter_drops | tcp_abn | Gauge | | drops caused by socket filter | -| tmout_count | tcp_abn | Gauge | | counter of tcp link timeout | -| snd_buf_limit_count | tcp_abn | Gauge | | counter of limits when allocate wmem | -| rmem_scheduls | tcp_abn | Gauge | | rmem is not enough | -| tcp_oom | tcp_abn | Gauge | | tcp out of memory | -| send_rsts | tcp_abn | Gauge | | send_rsts | -| receive_rsts | tcp_abn | Gauge | | receive_rsts | -| sk_err | tcp_abn | Gauge | | sk_err | -| sk_err_soft | tcp_abn | Gauge | | sk_err_soft | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| ------------------- | ----------- | ------------ | ------------------ | ---- | ------------------------------------------------------------ | +| tgid | | key | | | 进程ID | +| role | | key | | | 客户端/服务端 | +| client_ip | | key | | | 客户端:本地IP;服务端:对端IP | +| server_ip | | key | | | 客户端:对端IP;服务端:本地IP | +| client_port | | key | | | 客户端:本地端口;服务端:对端端口 | +| server_port | | key | | | 客户端:对端端口;服务端:本地端口 | +| protocol | | key | | | 协议族(IPv4、IPv6) | +| rx_bytes | tcp_tx_rx | Gauge | bytes | Y | rx bytes | +| tx_bytes | tcp_tx_rx | Gauge | bytes | Y | tx bytes | +| rto | tcp_rate | Gauge | | | Retransmission timeOut(us) | +| ato | tcp_rate | Gauge | | | Estimated value of delayed ACK(us) | +| srtt | tcp_rtt | Gauge | us | Y | Smoothed Round Trip Time(us). | +| snd_ssthresh | tcp_rate | Gauge | | | Slow start threshold for congestion control. | +| rcv_ssthresh | tcp_rate | Gauge | | | Current receive window size. | +| snd_cwnd | tcp_windows | Gauge | | | Congestion Control Window Size. | +| advmss | tcp_rate | Gauge | | | Local MSS upper limit. | +| reordering | tcp_windows | Gauge | | | Segments to be reordered. | +| rcv_rtt | tcp_rtt | Gauge | us | | Receive end RTT (unidirectional measurement). | +| rcv_space | tcp_rate | Gauge | | | Current receive buffer size. | +| notsent_bytes | tcp_windows | Gauge | bytes | | Number of bytes not sent currently. | +| notack_bytes | tcp_windows | Gauge | bytes | | Number of bytes not ack currently. | +| snd_wnd | tcp_windows | Gauge | | | Size of TCP send window. | +| rcv_wnd | tcp_windows | Gauge | | | Size of TCP receive window. | +| delivery_rate | tcp_rate | Gauge | | | Current transmit rate (multiple different from the actual value). | +| busy_time | tcp_rate | Gauge | | | Time (jiffies) busy sending data. | +| rwnd_limited | tcp_rate | Gauge | | | Time (jiffies) limited by receive window. | +| sndbuf_limited | tcp_rate | Gauge | | | Time (jiffies) limited by send buffer. | +| pacing_rate | tcp_rate | Gauge | bytes per second | | TCP pacing rate, bytes per second | +| max_pacing_rate | tcp_rate | Gauge | bytes per second | | MAX TCP pacing rate, bytes per second | +| sk_err_que_size | tcp_sockbuf | Gauge | | | Size of error queue in sock. | +| sk_rcv_que_size | tcp_sockbuf | Gauge | | | Size of receive queue in sock. | +| sk_wri_que_size | tcp_sockbuf | Gauge | | | Size of write queue in sock. | +| syn_srtt | tcp_srtt | Gauge | us | Y | RTT of syn packet(us). | +| sk_backlog_size | tcp_sockbuf | Gauge | | | Size of backlog queue in sock. | +| sk_omem_size | tcp_sockbuf | Gauge | | | Size of omem in sock. | +| sk_forward_size | tcp_sockbuf | Gauge | | | Size of forward in sock. | +| sk_wmem_size | tcp_sockbuf | Gauge | | | Size of wmem in sock. | +| segs_in | tcp_tx_rx | Counter | segs | | total number of segments received | +| segs_out | tcp_tx_rx | Counter | segs | | total number of segments sent | +| retran_packets | tcp_abn | Gauge | | Y | total number of retrans | +| backlog_drops | tcp_abn | Gauge | | Y | drops caused by backlog queue full | +| sk_drops | tcp_abn | Counter | | Y | tcp drop counter | +| lost_out | tcp_abn | Gauge | | | tcp lost counter | +| sacked_out | tcp_abn | Gauge | | | tcp sacked out counter | +| filter_drops | tcp_abn | Gauge | | | drops caused by socket filter | +| tmout_count | tcp_abn | Gauge | | | counter of tcp link timeout | +| snd_buf_limit_count | tcp_abn | Gauge | | | counter of limits when allocate wmem | +| rmem_scheduls | tcp_abn | Gauge | | | rmem is not enough | +| tcp_oom | tcp_abn | Gauge | | | tcp out of memory | +| send_rsts | tcp_abn | Gauge | | | send_rsts | +| receive_rsts | tcp_abn | Gauge | | | receive_rsts | +| sk_err | tcp_abn | Gauge | | | sk_err | +| sk_err_soft | tcp_abn | Gauge | | | sk_err_soft | # ENDPOINT -| metrics_name | table_name | metrics_type | unit | metrics description | -| ------------------- | ---------- | ------------ | ----- | ------------------------------------------------ | -| tgid | | key | | 进程ID | -| s_addr | | key | | udp/tcp 本地地址 | -| s_port | | key | | listen port(只有listen对象存在该label) | -| ep_type | | key | | listen/connect/udp/bind | -| listendrop | listen | Gauge | | TCP accept丢弃次数(只有listen对象存在) | -| accept_overflow | listen | Gauge | | TCP accept队列溢出次数 | -| syn_overflow | listen | Gauge | | TCP syn队列溢出次数 | -| passive_open | listen | Gauge | | tcp被动发起的建链次数(只有listen对象存在) | -| passive_open_failed | listen | Gauge | | tcp被动发起的建链失败次数(只有listen对象存在) | -| retran_synacks | listen | Gauge | | tcp synack重传报文数 | -| active_open | connect | Gauge | | tcp主动发起的建链次数(只有connect对象存在) | -| active_open_failed | connect | Gauge | | tcp主动发起的建链失败次数(只有connect对象存在) | -| bind_rcv_drops | bind | Gauge | | UDP接收失败次数(udp/bind对象存在) | -| bind_sends | bind | Gauge | bytes | UDP发送长度(udp/bind对象存在) | -| bind_rcvs | bind | Gauge | bytes | UDP接收长度(udp/bind对象存在) | -| bind_err | bind | Gauge | | UDP接收失败错误码(udp/bind对象存在) | -| udp_rcv_drops | udp | Gauge | | UDP接收失败次数(udp/bind对象存在) | -| udp_sends | udp | Gauge | bytes | UDP发送长度(udp/bind对象存在) | -| udp_rcvs | udp | Gauge | bytes | UDP接收长度(udp/bind对象存在) | -| udp_err | udp | Gauge | | UDP接收失败错误码(udp/bind对象存在) | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| ------------------- | ---------- | ------------ | ----- | ---- | ------------------------------------------------ | +| tgid | | key | | | 进程ID | +| s_addr | | key | | | udp/tcp 本地地址 | +| s_port | | key | | | listen port(只有listen对象存在该label) | +| ep_type | | key | | | listen/connect/udp/bind | +| listendrop | listen | Gauge | | Y | TCP accept丢弃次数(只有listen对象存在) | +| accept_overflow | listen | Gauge | | Y | TCP accept队列溢出次数 | +| syn_overflow | listen | Gauge | | Y | TCP syn队列溢出次数 | +| passive_open | listen | Gauge | | Y | tcp被动发起的建链次数(只有listen对象存在) | +| passive_open_failed | listen | Gauge | | Y | tcp被动发起的建链失败次数(只有listen对象存在) | +| retran_synacks | listen | Gauge | | | tcp synack重传报文数 | +| active_open | connect | Gauge | | | tcp主动发起的建链次数(只有connect对象存在) | +| active_open_failed | connect | Gauge | | | tcp主动发起的建链失败次数(只有connect对象存在) | +| bind_rcv_drops | bind | Gauge | | Y | UDP接收失败次数(udp/bind对象存在) | +| bind_sends | bind | Gauge | bytes | Y | UDP发送长度(udp/bind对象存在) | +| bind_rcvs | bind | Gauge | bytes | Y | UDP接收长度(udp/bind对象存在) | +| bind_err | bind | Gauge | | | UDP接收失败错误码(udp/bind对象存在) | +| udp_rcv_drops | udp | Gauge | | Y | UDP接收失败次数(udp/bind对象存在) | +| udp_sends | udp | Gauge | bytes | Y | UDP发送长度(udp/bind对象存在) | +| udp_rcvs | udp | Gauge | bytes | Y | UDP接收长度(udp/bind对象存在) | +| udp_err | udp | Gauge | | | UDP接收失败错误码(udp/bind对象存在) | # QDISC -| metrics_name | table_name | metrics_type | unit | metrics description | -| ------------ | ---------- | ------------ | ---- | -------------------------- | -| dev_name | qdisc | key | | 网卡设备名 | -| handle | qdisc | key | | 设备句柄 | -| ifindex | qdisc | key | | Interface index of qidsc | -| kind | qdisc | label | | Kind of qidsc | -| netns | qdisc | label | | net namespace | -| qlen | qdisc | Gauge | | 队列长度 | -| backlog | qdisc | Gauge | | backlog队列长度 | -| drops | qdisc | Counter | | 丢包数量 | -| requeues | qdisc | Counter | | Requeues count egress | -| overlimits | qdisc | Counter | | 溢出数量 | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| ------------ | ---------- | ------------ | ---- | ---- | -------------------------- | +| dev_name | qdisc | key | | | 网卡设备名 | +| handle | qdisc | key | | | 设备句柄 | +| ifindex | qdisc | key | | | Interface index of qidsc | +| kind | qdisc | label | | | Kind of qidsc | +| netns | qdisc | label | | | net namespace | +| qlen | qdisc | Gauge | | | 队列长度 | +| backlog | qdisc | Gauge | | Y | backlog队列长度 | +| drops | qdisc | Counter | | Y | 丢包数量 | +| requeues | qdisc | Counter | | Y | Requeues count egress | +| overlimits | qdisc | Counter | | Y | 溢出数量 | # THREAD(entity_name:task) -| metrics_name | table_name | metrics_type | unit | metrics description | -| --------------- | ---------- | ------------ | ----- | ------------------------------------------------------------ | -| pid | thread | key | | 线程PID | -| tgid | thread | label | | 所属进程ID | -| comm | thread | label | | 线程所属进程名称 | -| off_cpu_ns | thread | Gauge | ns | task调度offcpu的最大时间,统计方式: 1. KPROBE finish_task_switch 获取入参prev task(pid)以及当前时间,当前CPU信息(bpf_get_smp_processor_id()),记录MAP(pid/cpu作为key); 2. finish_task_switch 中bpf_get_current_pid_tgid获取当前pid,以及当前CPU信息(bpf_get_smp_processor_id()),匹配步骤1中的数据以及计算时间差,得出一次offcpu时间。 注意: 1. 过滤idle(pid=0) 2. 只记录offcpu最大值 | -| migration_count | thread | Gauge | | task CPU之间迁移次数 | -| iowait_us | thread | Gauge | us | task IO操作等待时间(单位us) | -| bio_bytes_write | thread | Gauge | bytes | task发起bio写操作字节数 | -| bio_bytes_read | thread | Gauge | bytes | task发起bio读操作字节数 | -| bio_err_count | thread | Gauge | | task发起的bio结果失败的次数 | -| hang_count | thread | Gauge | | task发生io hang次数 | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| --------------- | ---------- | ------------ | ----- | ---- | ------------------------------------------------------------ | +| pid | thread | key | | | 线程PID | +| tgid | thread | label | | | 所属进程ID | +| comm | thread | label | | | 线程所属进程名称 | +| off_cpu_ns | thread | Gauge | ns | Y | task调度offcpu的最大时间,统计方式: 1. KPROBE finish_task_switch 获取入参prev task(pid)以及当前时间,当前CPU信息(bpf_get_smp_processor_id()),记录MAP(pid/cpu作为key); 2. finish_task_switch 中bpf_get_current_pid_tgid获取当前pid,以及当前CPU信息(bpf_get_smp_processor_id()),匹配步骤1中的数据以及计算时间差,得出一次offcpu时间。 注意: 1. 过滤idle(pid=0) 2. 只记录offcpu最大值 | +| migration_count | thread | Gauge | | | task CPU之间迁移次数 | +| iowait_us | thread | Gauge | us | Y | task IO操作等待时间(单位us) | +| bio_bytes_write | thread | Gauge | bytes | Y | task发起bio写操作字节数 | +| bio_bytes_read | thread | Gauge | bytes | Y | task发起bio读操作字节数 | +| bio_err_count | thread | Gauge | | | task发起的bio结果失败的次数 | +| hang_count | thread | Gauge | | | task发生io hang次数 | # Process -| metrics_name | table_name | metrics_type | unit | metrics description | -| -------------------------- | ------------------ | ------------ | ---- | ------------------------------------------------------------ | -| tgid | | key | | 进程ID | -| ppid | system_proc | label | | 父进程ID | -| comm | system_proc | label | | 执行程序名称 | -| cmdline | system_proc | label | | 执行程序命令(包括配置) | -| container id | system_proc | label | | 进程归属的容器实例ID(简写) | -| proc_shared_dirty_size | system_proc | Gauge | | 进程共享属性的dirty page size | -| proc_shared_clean_size | system_proc | Gauge | | 进程共享属性的clean page size | -| proc_private_dirty_size | system_proc | Gauge | | 进程私有属性的dirty page size | -| proc_private_clean_size | system_proc | Gauge | | 进程私有属性的clean page size | -| proc_referenced_size | system_proc | Gauge | | 进程当前已引用的page size | -| proc_lazyfree_size | system_proc | Gauge | | 进程延迟释放内存的size | -| proc_swap_data_size | system_proc | Gauge | | 进程swap区间数据size | -| proc_swap_data_pss_size | system_proc | Gauge | | 进程物理内存swap区间数据size | -| fd_count | system_proc | Gauge | | 进程文件句柄 | -| fd_free_per | system_proc | Gauge | | 进程剩余FD资源占比% | -| proc_utime_jiffies | system_proc | Gauge | | 进程用户运行时间 | -| proc_stime_jiffies | system_proc | Gauge | | 进程系统态运行时间 | -| proc_minor pagefault_count | system_proc | Gauge | | 进程轻微pagefault次数(无需从磁盘拷贝) | -| proc_major pagefault_count | system_proc | Gauge | | 进程严重pagefault次数(需从磁盘拷贝) | -| proc_vm_size | system_proc | Gauge | | 进程当前虚拟地址空间大小 | -| proc_pm_size | system_proc | Gauge | | 进程当前物理地址空间大小 | -| proc_rchar_bytes | system_proc | Gauge | | 进程系统调用至FS的读字节数 | -| proc_wchar_bytes | system_proc | Gauge | | 进程系统调用至FS的写字节数 | -| proc_syscr_count | system_proc | Gauge | | 进程read()/pread()执行次数 | -| proc_syscw_count | system_proc | Gauge | | 进程write()/pwrite()执行次数 | -| proc_read_bytes | system_proc | Gauge | | 进程实际从磁盘读取的字节数 | -| proc_write_bytes | system_proc | Gauge | | 进程实际从磁盘写入的字节数 (page cache情况下,该字段进表示设置dirty page的size) | -| proc_cancelled_write_bytes | system_proc | Gauge | | 参考proc_write_bytes,因为存在page cache 如果write操作结束后,又发生文件被删除事件,会导致diry page并未写入磁盘,所以存在取消写的字节数统计 | -| ns_ext4_read | proc_ext4 | Gauge | ns | ext4文件系统读操作时间,单位ns | -| ns_ext4_write | proc_ext4 | Gauge | ns | ext4文件系统写操作时间,单位ns | -| ns_ext4_flush | proc_ext4 | Gauge | ns | ext4文件系统flush操作时间,单位ns | -| ns_ext4_open | proc_ext4 | Gauge | ns | ext4文件系统open操作时间,单位ns | -| ns_overlay_read | proc_overlay | Gauge | ns | overlayfs文件系统读操作时间,单位ns | -| ns_overlay_write | proc_overlay | Gauge | ns | overlayfs文件系统写操作时间,单位ns | -| ns_overlay_flush | proc_overlay | Gauge | ns | overlayfs文件系统flush操作时间,单位ns | -| ns_overlay_open | proc_overlay | Gauge | ns | overlayfs文件系统open操作时间,单位ns | -| ns_tmpfs_read | proc_tmpfs | Gauge | ns | tmpfs文件系统读操作时间,单位ns | -| ns_tmpfs_write | proc_tmpfs | Gauge | ns | tmpfs文件系统写操作时间,单位ns | -| ns_tmpfs_flush | proc_tmpfs | Gauge | ns | tmpfs文件系统flush操作时间,单位ns | -| reclaim_ns | proc_page | Gauge | ns | 进程触发的page回收时间(执行SWAP操作),单位ns | -| access_pagecache | proc_page | Gauge | | 进程触发的页面访问次数 | -| mark_buffer_dirty | proc_page | Gauge | | 进程触发的 page buffer置脏次数 | -| load_page_cache | proc_page | Gauge | | 进程触发的 page 加入page cache次数 | -| mark_page_dirty | proc_page | Gauge | | 进程触发的 page 置脏次数 | -| ns_gethostname | proc_dns | Gauge | ns | 进程获取DNS域名对应的地址,单位ns | -| gethostname_failed | proc_dns | Gauge | | 进程获取DNS域名失败次数 | -| ns_mount | proc_syscall_io | Gauge | ns | 进程系统调用mount时长,单位ns | -| ns_umount | proc_syscall_io | Gauge | ns | 进程系统调用umount时长,单位ns | -| ns_read | proc_syscall_io | Gauge | ns | 进程系统调用read时长,单位ns | -| ns_write | proc_syscall_io | Gauge | ns | 进程系统调用write时长,单位ns | -| ns_sendmsg | proc_syscall_net | Gauge | ns | 进程系统调用sendmsg时长,单位ns | -| ns_recvmsg | proc_syscall_net | Gauge | ns | 进程系统调用recvmsg时长,单位ns | -| ns_sched_yield | proc_syscall_sched | Gauge | ns | 进程系统调用sched_yield时长,单位ns | -| ns_futex | proc_syscall_sched | Gauge | ns | 进程系统调用futex时长,单位ns | -| ns_epoll_wait | proc_syscall_sched | Gauge | ns | 进程系统调用epoll_wait时长,单位ns | -| ns_epoll_pwait | proc_syscall_sched | Gauge | ns | 进程系统调用epoll_pwait时长,单位ns | -| ns_fork | proc_syscall_fork | Gauge | ns | 进程系统调用fork时长,单位ns | -| ns_vfork | proc_syscall_fork | Gauge | ns | 进程系统调用vfork时长,单位ns | -| ns_clone | proc_syscall_fork | Gauge | ns | 进程系统调用clone时长,单位ns | -| syscall_failed | proc_syscall | Gauge | | 进程系统调用失败次数 | -| | | | | | -| | | | | | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| -------------------------- | ------------------ | ------------ | ---- | ---- | ------------------------------------------------------------ | +| tgid | | key | | | 进程ID | +| ppid | system_proc | label | | | 父进程ID | +| comm | system_proc | label | | | 执行程序名称 | +| cmdline | system_proc | label | | | 执行程序命令(包括配置) | +| container id | system_proc | label | | | 进程归属的容器实例ID(简写) | +| proc_shared_dirty_size | system_proc | Gauge | | | 进程共享属性的dirty page size | +| proc_shared_clean_size | system_proc | Gauge | | | 进程共享属性的clean page size | +| proc_private_dirty_size | system_proc | Gauge | | | 进程私有属性的dirty page size | +| proc_private_clean_size | system_proc | Gauge | | | 进程私有属性的clean page size | +| proc_referenced_size | system_proc | Gauge | | | 进程当前已引用的page size | +| proc_lazyfree_size | system_proc | Gauge | | | 进程延迟释放内存的size | +| proc_swap_data_size | system_proc | Gauge | | | 进程swap区间数据size | +| proc_swap_data_pss_size | system_proc | Gauge | | | 进程物理内存swap区间数据size | +| fd_count | system_proc | Gauge | | Y | 进程文件句柄 | +| fd_free_per | system_proc | Gauge | | | 进程剩余FD资源占比% | +| proc_utime_jiffies | system_proc | Gauge | | Y | 进程用户运行时间 | +| proc_stime_jiffies | system_proc | Gauge | | Y | 进程系统态运行时间 | +| proc_minor pagefault_count | system_proc | Gauge | | | 进程轻微pagefault次数(无需从磁盘拷贝) | +| proc_major pagefault_count | system_proc | Gauge | | | 进程严重pagefault次数(需从磁盘拷贝) | +| proc_vm_size | system_proc | Gauge | | Y | 进程当前虚拟地址空间大小 | +| proc_pm_size | system_proc | Gauge | | Y | 进程当前物理地址空间大小 | +| proc_rchar_bytes | system_proc | Gauge | | | 进程系统调用至FS的读字节数 | +| proc_wchar_bytes | system_proc | Gauge | | | 进程系统调用至FS的写字节数 | +| proc_syscr_count | system_proc | Gauge | | | 进程read()/pread()执行次数 | +| proc_syscw_count | system_proc | Gauge | | | 进程write()/pwrite()执行次数 | +| proc_read_bytes | system_proc | Gauge | | | 进程实际从磁盘读取的字节数 | +| proc_write_bytes | system_proc | Gauge | | | 进程实际从磁盘写入的字节数 (page cache情况下,该字段进表示设置dirty page的size) | +| proc_cancelled_write_bytes | system_proc | Gauge | | | 参考proc_write_bytes,因为存在page cache 如果write操作结束后,又发生文件被删除事件,会导致diry page并未写入磁盘,所以存在取消写的字节数统计 | +| ns_ext4_read | proc_ext4 | Gauge | ns | | ext4文件系统读操作时间,单位ns | +| ns_ext4_write | proc_ext4 | Gauge | ns | | ext4文件系统写操作时间,单位ns | +| ns_ext4_flush | proc_ext4 | Gauge | ns | | ext4文件系统flush操作时间,单位ns | +| ns_ext4_open | proc_ext4 | Gauge | ns | | ext4文件系统open操作时间,单位ns | +| ns_overlay_read | proc_overlay | Gauge | ns | | overlayfs文件系统读操作时间,单位ns | +| ns_overlay_write | proc_overlay | Gauge | ns | | overlayfs文件系统写操作时间,单位ns | +| ns_overlay_flush | proc_overlay | Gauge | ns | | overlayfs文件系统flush操作时间,单位ns | +| ns_overlay_open | proc_overlay | Gauge | ns | | overlayfs文件系统open操作时间,单位ns | +| ns_tmpfs_read | proc_tmpfs | Gauge | ns | | tmpfs文件系统读操作时间,单位ns | +| ns_tmpfs_write | proc_tmpfs | Gauge | ns | | tmpfs文件系统写操作时间,单位ns | +| ns_tmpfs_flush | proc_tmpfs | Gauge | ns | | tmpfs文件系统flush操作时间,单位ns | +| reclaim_ns | proc_page | Gauge | ns | | 进程触发的page回收时间(执行SWAP操作),单位ns | +| access_pagecache | proc_page | Gauge | | | 进程触发的页面访问次数 | +| mark_buffer_dirty | proc_page | Gauge | | | 进程触发的 page buffer置脏次数 | +| load_page_cache | proc_page | Gauge | | | 进程触发的 page 加入page cache次数 | +| mark_page_dirty | proc_page | Gauge | | | 进程触发的 page 置脏次数 | +| ns_gethostname | proc_dns | Gauge | ns | | 进程获取DNS域名对应的地址,单位ns | +| gethostname_failed | proc_dns | Gauge | | | 进程获取DNS域名失败次数 | +| ns_mount | proc_syscall_io | Gauge | ns | | 进程系统调用mount时长,单位ns | +| ns_umount | proc_syscall_io | Gauge | ns | | 进程系统调用umount时长,单位ns | +| ns_read | proc_syscall_io | Gauge | ns | | 进程系统调用read时长,单位ns | +| ns_write | proc_syscall_io | Gauge | ns | | 进程系统调用write时长,单位ns | +| ns_sendmsg | proc_syscall_net | Gauge | ns | | 进程系统调用sendmsg时长,单位ns | +| ns_recvmsg | proc_syscall_net | Gauge | ns | | 进程系统调用recvmsg时长,单位ns | +| ns_sched_yield | proc_syscall_sched | Gauge | ns | | 进程系统调用sched_yield时长,单位ns | +| ns_futex | proc_syscall_sched | Gauge | ns | | 进程系统调用futex时长,单位ns | +| ns_epoll_wait | proc_syscall_sched | Gauge | ns | | 进程系统调用epoll_wait时长,单位ns | +| ns_epoll_pwait | proc_syscall_sched | Gauge | ns | | 进程系统调用epoll_pwait时长,单位ns | +| ns_fork | proc_syscall_fork | Gauge | ns | | 进程系统调用fork时长,单位ns | +| ns_vfork | proc_syscall_fork | Gauge | ns | | 进程系统调用vfork时长,单位ns | +| ns_clone | proc_syscall_fork | Gauge | ns | | 进程系统调用clone时长,单位ns | +| syscall_failed | proc_syscall | Gauge | | Y | 进程系统调用失败次数 | +| | | | | | | +| | | | | | | # BLOCK -| metrics_name | table_name | metrics_type | unit | metrics description | -| ----------------------- | ---------- | ------------ | ---- | ------------------------------- | -| major | block | key | | 块对象编号 | -| first_minor | block | key | | 块对象编号 | -| blk_type | block | label | | 块对象类型(比如disk, part) | -| blk_name | block | label | | 块对象名称 | -| disk_name | block | label | | 所属磁盘名称 | -| latency_req_max | block | Gauge | us | block层request时延最大值 | -| latency_req_last | block | Gauge | us | block层request时延最近值 | -| latency_req_sum | block | Gauge | us | block层request时延总计值 | -| latency_req_jitter | block | Gauge | us | block层request时延抖动 | -| count_latency_req | block | Gauge | | block层request操作次数 | -| latency_flush_max | block | Gauge | us | block层flush时延最大值 | -| latency_flush_last | block | Gauge | us | block层flush时延最近值 | -| latency_flush_sum | block | Gauge | us | block层flush时延总计值 | -| latency_flush_jitter | block | Gauge | us | block层flush时延抖动 | -| count_latency_flush | block | Gauge | | block层flush操作次数 | -| latency_driver_max | block | Gauge | us | 驱动层时延最大值 | -| latency_driver_last | block | Gauge | us | 驱动层时延最近值 | -| latency_driver_sum | block | Gauge | us | 驱动层时延最总计值 | -| latency_driver_jitter | block | Gauge | us | 驱动层时延抖动 | -| count_latency_driver | block | Gauge | | 驱动层操作次数 | -| latency_device_max | block | Gauge | us | 设备层时延最大值 | -| latency_device_last | block | Gauge | us | 设备层时延最近值 | -| latency_device_sum | block | Gauge | us | 设备层时延最总计值 | -| latency_device_jitter | block | Gauge | us | 设备层时延抖动 | -| count_latency_device | block | Gauge | | 设备层操作次数 | -| count_iscsi_tmout | block | Gauge | | iscsi层操作超时次数 | -| count_iscsi_err | block | Gauge | | iscsi层操作失败次数 | -| conn_err_bad_opcode | block | Gauge | | iscsi tp层错误操作码次数 | -| conn_err_xmit_failed | block | Gauge | | iscsi tp层发送失败次数 | -| conn_err_tmout | block | Gauge | | iscsi tp层超时次数 | -| conn_err_connect_failed | block | Gauge | | iscsi tp层建链失败次数 | -| count_sas_abort | block | Gauge | | iscsi sas层异常次数 | -| access_pagecache | block | Gauge | | Block页面访问次数 | -| mark_buffer_dirty | block | Gauge | | Block page buffer置脏次数 | -| load_page_cache | block | Gauge | | Block page 加入page cache次数 | -| mark_page_dirty | block | Gauge | | Block page 置脏次数 | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| ----------------------- | ---------- | ------------ | ---- | ---- | ------------------------------- | +| major | block | key | | | 块对象编号 | +| first_minor | block | key | | | 块对象编号 | +| blk_type | block | label | | | 块对象类型(比如disk, part) | +| blk_name | block | label | | | 块对象名称 | +| disk_name | block | label | | | 所属磁盘名称 | +| latency_req_max | block | Gauge | us | Y | block层request时延最大值 | +| latency_req_last | block | Gauge | us | | block层request时延最近值 | +| latency_req_sum | block | Gauge | us | | block层request时延总计值 | +| latency_req_jitter | block | Gauge | us | | block层request时延抖动 | +| count_latency_req | block | Gauge | | | block层request操作次数 | +| latency_flush_max | block | Gauge | us | Y | block层flush时延最大值 | +| latency_flush_last | block | Gauge | us | | block层flush时延最近值 | +| latency_flush_sum | block | Gauge | us | | block层flush时延总计值 | +| latency_flush_jitter | block | Gauge | us | | block层flush时延抖动 | +| count_latency_flush | block | Gauge | | | block层flush操作次数 | +| latency_driver_max | block | Gauge | us | | 驱动层时延最大值 | +| latency_driver_last | block | Gauge | us | | 驱动层时延最近值 | +| latency_driver_sum | block | Gauge | us | | 驱动层时延最总计值 | +| latency_driver_jitter | block | Gauge | us | | 驱动层时延抖动 | +| count_latency_driver | block | Gauge | | | 驱动层操作次数 | +| latency_device_max | block | Gauge | us | | 设备层时延最大值 | +| latency_device_last | block | Gauge | us | | 设备层时延最近值 | +| latency_device_sum | block | Gauge | us | | 设备层时延最总计值 | +| latency_device_jitter | block | Gauge | us | | 设备层时延抖动 | +| count_latency_device | block | Gauge | | | 设备层操作次数 | +| count_iscsi_tmout | block | Gauge | | | iscsi层操作超时次数 | +| count_iscsi_err | block | Gauge | | Y | iscsi层操作失败次数 | +| conn_err_bad_opcode | block | Gauge | | | iscsi tp层错误操作码次数 | +| conn_err_xmit_failed | block | Gauge | | | iscsi tp层发送失败次数 | +| conn_err_tmout | block | Gauge | | | iscsi tp层超时次数 | +| conn_err_connect_failed | block | Gauge | | | iscsi tp层建链失败次数 | +| count_sas_abort | block | Gauge | | | iscsi sas层异常次数 | +| access_pagecache | block | Gauge | | | Block页面访问次数 | +| mark_buffer_dirty | block | Gauge | | | Block page buffer置脏次数 | +| load_page_cache | block | Gauge | | | Block page 加入page cache次数 | +| mark_page_dirty | block | Gauge | | | Block page 置脏次数 | # Container -| metrics_name | table_name | metrics_type | unit | metrics description | -| -------------------------------------- | ----------------- | ------------ | ------- | ------------------------------------------------------------ | -| container_id | container | key | | 容器ID(简写) | -| name | container | label | | 容器名称 | -| cpucg_inode | container | label | | cpu,cpuacct cgroup ID(容器实例内cgroup目录对应的inode id) | -| memcg_inode | container | label | | memory cgroup ID(容器实例内cgroup目录对应的inode id) | -| pidcg_inode | container | label | | pids cgroup ID(容器实例内cgroup目录对应的inode id) | -| mnt_ns_id | container | label | | mount namespace | -| net_ns_id | container | label | | net namespace | -| proc_id | container | label | | 容器主进程ID | -| blkio_device_usage_total | container_blkio | Gauge | bytes | Blkio device bytes usage, unit bytes | -| cpu_load_average_10s | container_cpu | Gauge | | Value of container cpu load average over the last 10 seconds | -| cpu_system_seconds_total | container_cpu | Gauge | seconds | Cumulative system cpu time consumed, unit second | -| cpu_usage_seconds_total | container_cpu | Gauge | seconds | Cumulative cpu time consumed, unit second | -| cpu_user_seconds_total | container_cpu | Gauge | seconds | Cumulative user cpu time consumed, unit second | -| fs_inodes_free | container_fs | Gauge | | Number of available Inodes | -| fs_inodes_total | container_fs | Gauge | | Total number of Inodes | -| fs_io_current | container_fs | Gauge | | Number of I/Os currently in progress | -| fs_io_time_seconds_total | container_fs | Gauge | seconds | Cumulative count of seconds spent doing I/Os, unit second | -| fs_io_time_weighted_seconds_total | container_fs | Gauge | seconds | Cumulative weighted I/O time, unit second | -| fs_limit_bytes | container_fs | Gauge | bytes | Number of bytes that can be consumed by the container on this filesystem, unit bytes | -| fs_read_seconds_total | container_fs | Gauge | bytes | Cumulative count of bytes read, unit bytes | -| fs_reads_bytes_total | container_fs | Gauge | bytes | Cumulative count of bytes read | -| fs_reads_merged_total | container_fs | Gauge | | Cumulative count of reads merged | -| fs_reads_total | container_fs | Gauge | | Cumulative count of reads completed | -| fs_sector_reads_total | container_fs | Gauge | | Cumulative count of sector reads completed | -| fs_sector_writes_total | container_fs | Gauge | | Cumulative count of sector writes completed | -| fs_usage_bytes | container_fs | Gauge | bytes | Number of bytes that are consumed by the container on this filesystem | -| fs_write_seconds_total | container_fs | Gauge | seconds | Cumulative count of seconds spent writing | -| fs_writes_bytes_total | container_fs | Gauge | bytes | Cumulative count of bytes written | -| fs_writes_merged_total | container_fs | Gauge | | Cumulative count of writes merged | -| fs_writes_total | container_fs | Gauge | | Cumulative count of writes completed | -| memory_cache | container_memory | Gauge | bytes | Total page cache memory | -| memory_failcnt | container_memory | Gauge | | Number of memory usage hits limits | -| memory_failures_total | container_memory | Gauge | | Cumulative count of memory allocation failures | -| memory_mapped_file | container_memory | Gauge | bytes | Size of memory mapped files | -| memory_max_usage_bytes | container_memory | Gauge | bytes | Maximum memory usage recorded | -| memory_rss | container_memory | Gauge | bytes | Size of RSS | -| memory_swap | container_memory | Gauge | bytes | Container swap usage | -| memory_usage_bytes | container_memory | Gauge | bytes | Current memory usage, including all memory regardless of when it was accessed | -| memory_working_set_bytes | container_memory | Gauge | bytes | Current working set | -| network_receive_bytes_total | container_network | Gauge | bytes | Cumulative count of bytes received | -| network_receive_errors_total | container_network | Gauge | | Cumulative count of errors encountered while receiving | -| network_receive_packets_dropped_total | container_network | Gauge | | Cumulative count of packets dropped while receiving | -| network_receive_packets_total | container_network | Gauge | | Cumulative count of packets received | -| network_transmit_bytes_total | container_network | Gauge | bytes | Cumulative count of bytes transmitted | -| network_transmit_errors_total | container_network | Gauge | | Cumulative count of errors encountered while transmitting | -| network_transmit_packets_dropped_total | container_network | Gauge | | Cumulative count of packets dropped while transmitting | -| network_transmit_packets_total | container_network | Gauge | | Cumulative count of packets transmitted | -| oom_events_total | container_oom | Gauge | | Count of out of memory events observed for the container | -| spec_cpu_period | container_spec | Gauge | | CPU period of the container | -| spec_cpu_shares | container_spec | Gauge | | CPU share of the container | -| spec_memory_limit_bytes | container_spec | Gauge | bytes | Memory limit for the container | -| spec_memory_reservation_limit_bytes | container_spec | Gauge | bytes | Memory reservation limit for the container | -| spec_memory_swap_limit_bytes | container_spec | Gauge | bytes | Memory swap limit for the container | -| start_time_seconds | container_start | Gauge | seconds | Start time of the container since unix epoch | -| tasks_state | container_tasks | Gauge | | Number of tasks in given state (sleeping, running, stopped, uninterruptible, or ioawaiting) | -| | | | | | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| -------------------------------------- | ----------------- | ------------ | ------- | ---- | ------------------------------------------------------------ | +| container_id | container | key | | | 容器ID(简写) | +| name | container | label | | | 容器名称 | +| cpucg_inode | container | label | | | cpu,cpuacct cgroup ID(容器实例内cgroup目录对应的inode id) | +| memcg_inode | container | label | | | memory cgroup ID(容器实例内cgroup目录对应的inode id) | +| pidcg_inode | container | label | | | pids cgroup ID(容器实例内cgroup目录对应的inode id) | +| mnt_ns_id | container | label | | | mount namespace | +| net_ns_id | container | label | | | net namespace | +| proc_id | container | label | | | 容器主进程ID | +| blkio_device_usage_total | container_blkio | Gauge | bytes | | Blkio device bytes usage, unit bytes | +| cpu_load_average_10s | container_cpu | Gauge | | | Value of container cpu load average over the last 10 seconds | +| cpu_system_seconds_total | container_cpu | Gauge | seconds | Y | Cumulative system cpu time consumed, unit second | +| cpu_usage_seconds_total | container_cpu | Gauge | seconds | Y | Cumulative cpu time consumed, unit second | +| cpu_user_seconds_total | container_cpu | Gauge | seconds | | Cumulative user cpu time consumed, unit second | +| fs_inodes_free | container_fs | Gauge | | | Number of available Inodes | +| fs_inodes_total | container_fs | Gauge | | | Total number of Inodes | +| fs_io_current | container_fs | Gauge | | | Number of I/Os currently in progress | +| fs_io_time_seconds_total | container_fs | Gauge | seconds | | Cumulative count of seconds spent doing I/Os, unit second | +| fs_io_time_weighted_seconds_total | container_fs | Gauge | seconds | | Cumulative weighted I/O time, unit second | +| fs_limit_bytes | container_fs | Gauge | bytes | | Number of bytes that can be consumed by the container on this filesystem, unit bytes | +| fs_read_seconds_total | container_fs | Gauge | bytes | | Cumulative count of bytes read, unit bytes | +| fs_reads_bytes_total | container_fs | Gauge | bytes | | Cumulative count of bytes read | +| fs_reads_merged_total | container_fs | Gauge | | | Cumulative count of reads merged | +| fs_reads_total | container_fs | Gauge | | | Cumulative count of reads completed | +| fs_sector_reads_total | container_fs | Gauge | | | Cumulative count of sector reads completed | +| fs_sector_writes_total | container_fs | Gauge | | | Cumulative count of sector writes completed | +| fs_usage_bytes | container_fs | Gauge | bytes | | Number of bytes that are consumed by the container on this filesystem | +| fs_write_seconds_total | container_fs | Gauge | seconds | | Cumulative count of seconds spent writing | +| fs_writes_bytes_total | container_fs | Gauge | bytes | | Cumulative count of bytes written | +| fs_writes_merged_total | container_fs | Gauge | | | Cumulative count of writes merged | +| fs_writes_total | container_fs | Gauge | | | Cumulative count of writes completed | +| memory_cache | container_memory | Gauge | bytes | Y | Total page cache memory | +| memory_failcnt | container_memory | Gauge | | | Number of memory usage hits limits | +| memory_failures_total | container_memory | Gauge | | | Cumulative count of memory allocation failures | +| memory_mapped_file | container_memory | Gauge | bytes | | Size of memory mapped files | +| memory_max_usage_bytes | container_memory | Gauge | bytes | | Maximum memory usage recorded | +| memory_rss | container_memory | Gauge | bytes | Y | Size of RSS | +| memory_swap | container_memory | Gauge | bytes | Y | Container swap usage | +| memory_usage_bytes | container_memory | Gauge | bytes | Y | Current memory usage, including all memory regardless of when it was accessed | +| memory_working_set_bytes | container_memory | Gauge | bytes | | Current working set | +| network_receive_bytes_total | container_network | Gauge | bytes | | Cumulative count of bytes received | +| network_receive_errors_total | container_network | Gauge | | | Cumulative count of errors encountered while receiving | +| network_receive_packets_dropped_total | container_network | Gauge | | | Cumulative count of packets dropped while receiving | +| network_receive_packets_total | container_network | Gauge | | Y | Cumulative count of packets received | +| network_transmit_bytes_total | container_network | Gauge | bytes | Y | Cumulative count of bytes transmitted | +| network_transmit_errors_total | container_network | Gauge | | | Cumulative count of errors encountered while transmitting | +| network_transmit_packets_dropped_total | container_network | Gauge | | | Cumulative count of packets dropped while transmitting | +| network_transmit_packets_total | container_network | Gauge | | | Cumulative count of packets transmitted | +| oom_events_total | container_oom | Gauge | | | Count of out of memory events observed for the container | +| spec_cpu_period | container_spec | Gauge | | | CPU period of the container | +| spec_cpu_shares | container_spec | Gauge | | | CPU share of the container | +| spec_memory_limit_bytes | container_spec | Gauge | bytes | | Memory limit for the container | +| spec_memory_reservation_limit_bytes | container_spec | Gauge | bytes | | Memory reservation limit for the container | +| spec_memory_swap_limit_bytes | container_spec | Gauge | bytes | | Memory swap limit for the container | +| start_time_seconds | container_start | Gauge | seconds | | Start time of the container since unix epoch | +| tasks_state | container_tasks | Gauge | | | Number of tasks in given state (sleeping, running, stopped, uninterruptible, or ioawaiting) | +| | | | | | | # DISK -| metrics_name | table_name | metrics_type | unit | metrics description | -| ------------ | ------------- | ------------ | --------------------- | --------------------------------------- | -| disk_name | system_iostat | key | | blk所在的物理磁盘名称 | -| rspeed | system_iostat | gauge | read bytes/second | 读速率(IOPS) | -| rspeed_kB | system_iostat | gauge | read kbytes/second | 吞吐量 | -| r_await | system_iostat | gauge | ms | 读响应时间 | -| rareq | system_iostat | gauge | | 饱和度(rareq-sz 和 wareq-sz+响应时间) | -| wspeed | system_iostat | gauge | write bytes/second | 写速率(IOPS) | -| wspeed_kB | system_iostat | gauge | write kbytes/second | 吞吐量 | -| w_await | system_iostat | gauge | ms | 写响应时间 | -| wareq | system_iostat | gauge | | 饱和度(rareq-sz 和 wareq-sz+响应时间) | -| util | system_iostat | gauge | % | 磁盘使用率 | +| metrics_name | table_name | metrics_type | unit | KPI | metrics description | +| ------------ | ------------- | ------------ | --------------------- | ---- | --------------------------------------- | +| disk_name | system_iostat | key | | | blk所在的物理磁盘名称 | +| rspeed | system_iostat | gauge | read bytes/second | Y | 读速率(IOPS) | +| rspeed_kB | system_iostat | gauge | read kbytes/second | Y | 吞吐量 | +| r_await | system_iostat | gauge | ms | Y | 读响应时间 | +| rareq | system_iostat | gauge | | Y | 饱和度(rareq-sz 和 wareq-sz+响应时间) | +| wspeed | system_iostat | gauge | write bytes/second | Y | 写速率(IOPS) | +| wspeed_kB | system_iostat | gauge | write kbytes/second | Y | 吞吐量 | +| w_await | system_iostat | gauge | ms | Y | 写响应时间 | +| wareq | system_iostat | gauge | | | 饱和度(rareq-sz 和 wareq-sz+响应时间) | +| util | system_iostat | gauge | % | Y | 磁盘使用率 | diff --git a/host_anomaly_detection.png b/host_anomaly_detection.png new file mode 100644 index 0000000000000000000000000000000000000000..71bed752d9c1569a879ca990679ded19117b23ef Binary files /dev/null and b/host_anomaly_detection.png differ diff --git a/host_horizontal_topology.png b/host_horizontal_topology.png new file mode 100644 index 0000000000000000000000000000000000000000..6cca55c3d8b30372be17942d23ff8378960c4a7a Binary files /dev/null and b/host_horizontal_topology.png differ diff --git a/host_hrchitecture.png b/host_hrchitecture.png deleted file mode 100644 index b9ddfcc9c4f3f121b1a1c7b1c973d2e8f9512cf7..0000000000000000000000000000000000000000 Binary files a/host_hrchitecture.png and /dev/null differ diff --git a/host_vertical_topology.png b/host_vertical_topology.png new file mode 100644 index 0000000000000000000000000000000000000000..e70a31bd5918e7f925886ea00a641f78940b3d60 Binary files /dev/null and b/host_vertical_topology.png differ diff --git a/openGauss.png b/openGauss.png index 12f144b481e9ebc25402b459f058e41419eaf2c5..dd977f175d6b216dd4037c838d4f75e0133ed823 100644 Binary files a/openGauss.png and b/openGauss.png differ diff --git a/proc_anomaly_detection.png b/proc_anomaly_detection.png new file mode 100644 index 0000000000000000000000000000000000000000..7fa7ed67c533099d3fed4bebea846a0b55d7ca4b Binary files /dev/null and b/proc_anomaly_detection.png differ diff --git a/proc_horizontal_change.png b/proc_horizontal_change.png new file mode 100644 index 0000000000000000000000000000000000000000..91f5d2ffce6ec8da3b7665e5f6914872084b4d0c Binary files /dev/null and b/proc_horizontal_change.png differ diff --git a/proc_horizontal_topology.png b/proc_horizontal_topology.png new file mode 100644 index 0000000000000000000000000000000000000000..1535cf228f72de309ee4d7a501c67ae029dec6a6 Binary files /dev/null and b/proc_horizontal_topology.png differ diff --git a/proc_vertical_topology.png b/proc_vertical_topology.png new file mode 100644 index 0000000000000000000000000000000000000000..08a63a380bb3e0af3f418f6fb07f5d91d227c79f Binary files /dev/null and b/proc_vertical_topology.png differ diff --git a/vertical_topology.png b/vertical_topology.png deleted file mode 100644 index e59a244d8e18df3a8d141bb3315530a125bfebc8..0000000000000000000000000000000000000000 Binary files a/vertical_topology.png and /dev/null differ