From 5063b31d18988f7f93f3e20082adaa8490820db5 Mon Sep 17 00:00:00 2001 From: luzhihao 00478106 Date: Thu, 8 Sep 2022 21:53:37 +0800 Subject: [PATCH] update docs --- README.md | 126 +++++++++++++++-- gopher_tech.md | 297 +++++++++++++++++++++++++++++++++++++++++ system_integration.png | Bin 0 -> 6409 bytes 3 files changed, 413 insertions(+), 10 deletions(-) create mode 100644 gopher_tech.md create mode 100644 system_integration.png diff --git a/README.md b/README.md index e5072ac..c2e0962 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ ## 架构 -gala-ops具备4个组件(gala-gopher、gala-spider、gala-anteater、gala-inference),可以选择集群部署模式、单机部署模式。 +gala-ops具备四个组件(gala-gopher、gala-spider、gala-anteater、gala-inference),可以选择集群部署模式、单机部署模式。 集群模式下,需要部署全部组件;单机模式可以只部署gala-gopher。 @@ -49,41 +49,106 @@ gala-ops具备4个组件(gala-gopher、gala-spider、gala-anteater、gala-infe ![](./csp_arch.png) +## gala-ops系统集成 + +集群部署模式下,gala-ops四个组件需要协同起来工作,主要依赖kafka、arangodb等软件。下图是系统集成关系图,通常将gala-gopher部署于生产环境中,其他组件(包括kafka、prometheus、arangodb等中间件)部署于管理面,用户需要确保管理面内几个组件与中间件能够互通。 + +gala-gopher与管理面可能无法直接互通,为此gala-gopher也提供多种[被集成方式](https://gitee.com/openeuler/gala-docs/blob/master/README.md "被集成方式")。 + +![](./system_integration.png) + ## gala-gopher ### 定位 -作为gala-ops系统的数据采集器,提供应用粒度low-level的数据采集,包括网络、磁盘I/O、调度、内存、安全等方面的系统指标采集,同时负责应用KPI数据的采集。 +- **数据采集器**:提供应用粒度low-level的数据采集,包括网络、磁盘I/O、调度、内存、安全等方面的系统指标采集,同时负责应用KPI数据的采集。 +- **系统异常检测**:提供系统异常检测能力,覆盖网络、磁盘I/O、调度、内存等方面的场景系统异常,用户可以通过阈值设置异常上下限范围。 +- **性能热点分析**:提供on-cpu、off-cpu火焰图。 ### 原理及术语 +gala-gopher软件架构参考[这里](https://gitee.com/openeuler/gala-gopher/tree/master#%E8%BF%90%E8%A1%8C%E6%9E%B6%E6%9E%84),其是一款基于eBPF技术的低负载探针框架,除了其自身采集的数据外,用户可以自由扩展第三方探针。 + +**术语** + +- **探针**:gala-gopher内执行具体数据采集任务的程序,包括native、extend 2类探针,前者以线程方式单独启动数据采集任务,后者以子进程方式启动数据采集任务。gala-gopher可以通过配置修改的方式启动部分或全部探针。 +- **观测实体(entity_name)**:用来定义系统内的观测对象,所有探针采集的数据均会归属到具体的某个观测实体。每种观测实体均有key、label、metrics组成,比如tcp_link观测实体的key包括进程号、IP五元组、协议族等信息,metrics则包括tx、rx、rtt等运行状态指标。 +- **数据表(table_name)**:观测实体由1张或更多数据表组合而成,通常1张数据表由1个采集任务完成,由此可知单个观测实体可以由多个采集任务共同完成。 +- **meta文件**:通过文件定义观测实体(包括内部的数据表),系统内meta文件必须保证唯一,定义不可冲突。规范参考[这里](https://gitee.com/openeuler/gala-gopher/blob/master/doc/how_to_add_probe.md#122-%E5%AE%9A%E4%B9%89%E6%8E%A2%E9%92%88%E7%9A%84meta%E6%96%87%E4%BB%B6)。 + ### 支持的技术 -### 安装方式 +参考这里 +### 安装及使用 +参考[这里](https://gitee.com/openeuler/gala-gopher#%E5%BF%AB%E9%80%9F%E5%BC%80%E5%A7%8B) ### 被集成方式 +- **metrics集成方式** + + **prometheus exporter方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric),设置metrics成web上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#webserver%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以prometheus exporter方式工作,被动响应metrics数据GET请求。 + + **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#metric),设置metrics成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报metrics。用户需将metrics数据转移至prometheus内。 + +- **event集成方式** + + **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event),设置event成logs上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以logs方式工作,将event以日志形式写入设定目录。用户可以通过读取该目录文件,获取gala-gopher上报的event信息并上送至kafka通道内。 + + **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#event),设置event成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报event。 + +- **meta文件集成方式** + + **logs方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta),设置meta成logs上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#logs%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以logs方式工作,将gala-gopher集成的所有meta文件以日志形式写入设定目录。用户需要将meta信息上送至kafka通道内。 + + **kafka client方式**:用户根据gala-gopher配置文件[手册](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#meta),设置event成kafka上报方式,以及上报[通道](https://gitee.com/openeuler/gala-gopher/blob/master/doc/conf_introduction.md#kafka%E9%85%8D%E7%BD%AE)设置,gala-gopher就会以kafka client方式工作,周期性上报meta信息。 + ### 扩展探针 + + ## gala-spider +定位 + +原理及术语 + +支持的技术 + +安装及使用 + +扩展观测实体及关系 + ## gala-anteater +定位 + +原理及术语 + +安装及使用 + ## gala-inference +定位 + +原理及术语 + +安装及使用 + # 场景介绍 ## 架构感知 -gala-ops将系统观测白盒化,通过定义系统观测实体类型以及类型间的关系,完成水平、垂直方向的拓扑结构。 +​ gala-ops将系统观测白盒化,通过定义系统观测实体以及实体间的关系,完成水平、垂直方向的拓扑结构。水平拓扑是基于进程之间的TCP/IP通信链路计算得出,可以实时呈现系统集群业务流状态;垂直拓扑是基于系统领域知识计算得出,可以实时呈现软件运行上下文状态。 -### 支持的实体类型 +### 观测实体 + +水平、垂直拓扑内展现出系统内所有观测实体,gala-ops支持的观测实体范围如下: 主机(host) @@ -207,24 +272,65 @@ gala-ops将系统观测白盒化,通过定义系统观测实体类型以及类 通过选择水平拓扑内的对象(比如上图中应用实例、进程实例或Node实例)可以从不同对象实例视角观察垂直拓扑。 -下面从左至右分别给出应用实例、进程实例、Node实例视角的垂直拓扑效果。 - -![](./vertical_topology.png) +下面从左至右分别给出应用实例、进程实例、Node实例视角的垂直拓扑效果。![](./vertical_topology.png) 垂直拓扑API:? +另外,为了让水平、垂直拓扑视图可以自由切换,垂直拓扑视图内可以选择任意应用实例、进程实例、Node实例跳转至水平拓扑。相应的API如下:? + + + ## 异常检测 -gala-ops具备2种异常检测能力:系统异常(也叫系统隐患)、应用KPI异常。前者覆盖网络、磁盘I/O、调度、内存、文件系统等各类系统异常场景;后者包括常见应用时延KPI异常(包括redis、HTTP(S)、PG等),并且用户可以根据自身场景扩展KPI范围(要求KPI数据符合时序数据规范)。 +gala-ops具备2种异常检测能力:系统异常(也叫系统隐患)、应用异常。前者覆盖网络、磁盘I/O、调度、内存、文件系统等各类系统异常场景;后者包括常见应用时延KPI异常(包括redis、HTTP(S)、PG等),并且用户可以根据自身场景扩展KPI范围(要求KPI数据符合时序数据规范)。 异常检测结果会标识出具体的观测实体,以及异常原因。用户可以通过kafka topic获取系统实时异常信息。通过例子我们解读下异常结果信息: +``` +{ + "Timestamp": 1586960586000000000, // 异常事件时间戳 + "event_id": "1586xxx_xxxx" // 异常事件ID + "Attributes": { + "entity_id": "xx", // 发生异常的观测实体ID(集群内唯一) + "event_id": "1586xxx_xxxx", // 异常事件ID(同上) + "event_type": "sys", // 异常事件类型(sys: 系统异常,app:应用异常) + "data": [....], // optional + "duration": 30, // optional + "occurred count": 6,// optional + }, + "Resource": { + "metrics": "gala_gopher_block_count_iscsi_err", // 产生异常的metrics + }, + "SeverityText": "WARN", // 异常级别 + "SeverityNumber": 13, // 异常级别编号 + "Body": "20200415T072306-0700 WARN Entity(xx) Iscsi errors(2) occured on Block(sda1, disk sda)." // 异常事件描述 +} +``` + + + +用户通过kafka订阅到异常事件后,可以表格化管理,以时间段形式呈现管理,如下: + +| 时间 | 异常事件ID | 观测实体ID | Metrics | 描述 | +| ----------------- | ------------ | ---------- | --------------------------------- | ------------------------------------------------------------ | +| 11:23:54 CST 2022 | 1586xxx_xxxx | xxx_xxxx | gala_gopher_block_count_iscsi_err | 20200415T072306-0700 WARN Entity(xx) Iscsi errors(2) occured on Block(sda1, disk sda). | + +**注意**:一定时间段范围内,同一个观测实体可能会报重复上报相同异常(但事件ID不同)。所以需要基于**观测实体ID + Metrics** 去重处理。 + +为了更好的展示异常事件所处集群系统的位置,用户可以通过异常表格内的观测实体ID跳转至垂直拓扑视图,举例如下: + +![](./host_hrchitecture.png) + ## 根因定位 + + + + ## 全栈热点分析 @@ -249,6 +355,6 @@ gala-ops具备2种异常检测能力:系统异常(也叫系统隐患)、 -# 客户案例 +# 用户案例 # 合作厂商 \ No newline at end of file diff --git a/gopher_tech.md b/gopher_tech.md new file mode 100644 index 0000000..c75cd74 --- /dev/null +++ b/gopher_tech.md @@ -0,0 +1,297 @@ +# TCP(entity_name:tcp_link) + +| metrics_name | table_name | metrics_type | unit | metrics description | +| ------------------- | ----------- | ------------ | ------------------ | ------------------------------------------------------------ | +| tgid | | key | | 进程ID | +| role | | key | | 客户端/服务端 | +| client_ip | | key | | 客户端:本地IP;服务端:对端IP | +| server_ip | | key | | 客户端:对端IP;服务端:本地IP | +| client_port | | key | | 客户端:本地端口;服务端:对端端口 | +| server_port | | key | | 客户端:对端端口;服务端:本地端口 | +| protocol | | key | | 协议族(IPv4、IPv6) | +| rx_bytes | tcp_tx_rx | Gauge | bytes | rx bytes | +| tx_bytes | tcp_tx_rx | Gauge | bytes | tx bytes | +| rto | tcp_rate | Gauge | | Retransmission timeOut(us) | +| ato | tcp_rate | Gauge | | Estimated value of delayed ACK(us) | +| srtt | tcp_rtt | Gauge | us | Smoothed Round Trip Time(us). | +| snd_ssthresh | tcp_rate | Gauge | | Slow start threshold for congestion control. | +| rcv_ssthresh | tcp_rate | Gauge | | Current receive window size. | +| snd_cwnd | tcp_windows | Gauge | | Congestion Control Window Size. | +| advmss | tcp_rate | Gauge | | Local MSS upper limit. | +| reordering | tcp_windows | Gauge | | Segments to be reordered. | +| rcv_rtt | tcp_rtt | Gauge | us | Receive end RTT (unidirectional measurement). | +| rcv_space | tcp_rate | Gauge | | Current receive buffer size. | +| notsent_bytes | tcp_windows | Gauge | bytes | Number of bytes not sent currently. | +| notack_bytes | tcp_windows | Gauge | bytes | Number of bytes not ack currently. | +| snd_wnd | tcp_windows | Gauge | | Size of TCP send window. | +| rcv_wnd | tcp_windows | Gauge | | Size of TCP receive window. | +| delivery_rate | tcp_rate | Gauge | | Current transmit rate (multiple different from the actual value). | +| busy_time | tcp_rate | Gauge | | Time (jiffies) busy sending data. | +| rwnd_limited | tcp_rate | Gauge | | Time (jiffies) limited by receive window. | +| sndbuf_limited | tcp_rate | Gauge | | Time (jiffies) limited by send buffer. | +| pacing_rate | tcp_rate | Gauge | bytes per second | TCP pacing rate, bytes per second | +| max_pacing_rate | tcp_rate | Gauge | bytes per second | MAX TCP pacing rate, bytes per second | +| sk_err_que_size | tcp_sockbuf | Gauge | | Size of error queue in sock. | +| sk_rcv_que_size | tcp_sockbuf | Gauge | | Size of receive queue in sock. | +| sk_wri_que_size | tcp_sockbuf | Gauge | | Size of write queue in sock. | +| syn_srtt | tcp_srtt | Gauge | us | RTT of syn packet(us). | +| sk_backlog_size | tcp_sockbuf | Gauge | | Size of backlog queue in sock. | +| sk_omem_size | tcp_sockbuf | Gauge | | Size of omem in sock. | +| sk_forward_size | tcp_sockbuf | Gauge | | Size of forward in sock. | +| sk_wmem_size | tcp_sockbuf | Gauge | | Size of wmem in sock. | +| segs_in | tcp_tx_rx | Counter | segs | total number of segments received | +| segs_out | tcp_tx_rx | Counter | segs | total number of segments sent | +| retran_packets | tcp_abn | Gauge | | total number of retrans | +| backlog_drops | tcp_abn | Gauge | | drops caused by backlog queue full | +| sk_drops | tcp_abn | Counter | | tcp drop counter | +| lost_out | tcp_abn | Gauge | | tcp lost counter | +| sacked_out | tcp_abn | Gauge | | tcp sacked out counter | +| filter_drops | tcp_abn | Gauge | | drops caused by socket filter | +| tmout_count | tcp_abn | Gauge | | counter of tcp link timeout | +| snd_buf_limit_count | tcp_abn | Gauge | | counter of limits when allocate wmem | +| rmem_scheduls | tcp_abn | Gauge | | rmem is not enough | +| tcp_oom | tcp_abn | Gauge | | tcp out of memory | +| send_rsts | tcp_abn | Gauge | | send_rsts | +| receive_rsts | tcp_abn | Gauge | | receive_rsts | +| sk_err | tcp_abn | Gauge | | sk_err | +| sk_err_soft | tcp_abn | Gauge | | sk_err_soft | + +# ENDPOINT + +| metrics_name | table_name | metrics_type | unit | metrics description | +| ------------------- | ---------- | ------------ | ----- | ------------------------------------------------ | +| tgid | | key | | 进程ID | +| s_addr | | key | | udp/tcp 本地地址 | +| s_port | | key | | listen port(只有listen对象存在该label) | +| ep_type | | key | | listen/connect/udp/bind | +| listendrop | listen | Gauge | | TCP accept丢弃次数(只有listen对象存在) | +| accept_overflow | listen | Gauge | | TCP accept队列溢出次数 | +| syn_overflow | listen | Gauge | | TCP syn队列溢出次数 | +| passive_open | listen | Gauge | | tcp被动发起的建链次数(只有listen对象存在) | +| passive_open_failed | listen | Gauge | | tcp被动发起的建链失败次数(只有listen对象存在) | +| retran_synacks | listen | Gauge | | tcp synack重传报文数 | +| active_open | connect | Gauge | | tcp主动发起的建链次数(只有connect对象存在) | +| active_open_failed | connect | Gauge | | tcp主动发起的建链失败次数(只有connect对象存在) | +| bind_rcv_drops | bind | Gauge | | UDP接收失败次数(udp/bind对象存在) | +| bind_sends | bind | Gauge | bytes | UDP发送长度(udp/bind对象存在) | +| bind_rcvs | bind | Gauge | bytes | UDP接收长度(udp/bind对象存在) | +| bind_err | bind | Gauge | | UDP接收失败错误码(udp/bind对象存在) | +| udp_rcv_drops | udp | Gauge | | UDP接收失败次数(udp/bind对象存在) | +| udp_sends | udp | Gauge | bytes | UDP发送长度(udp/bind对象存在) | +| udp_rcvs | udp | Gauge | bytes | UDP接收长度(udp/bind对象存在) | +| udp_err | udp | Gauge | | UDP接收失败错误码(udp/bind对象存在) | + +# QDISC + +| metrics_name | table_name | metrics_type | unit | metrics description | +| ------------ | ---------- | ------------ | ---- | -------------------------- | +| dev_name | qdisc | key | | 网卡设备名 | +| handle | qdisc | key | | 设备句柄 | +| ifindex | qdisc | key | | Interface index of qidsc | +| kind | qdisc | label | | Kind of qidsc | +| netns | qdisc | label | | net namespace | +| qlen | qdisc | Gauge | | 队列长度 | +| backlog | qdisc | Gauge | | backlog队列长度 | +| drops | qdisc | Counter | | 丢包数量 | +| requeues | qdisc | Counter | | Requeues count egress | +| overlimits | qdisc | Counter | | 溢出数量 | + +# THREAD(entity_name:task) + +| metrics_name | table_name | metrics_type | unit | metrics description | +| --------------- | ---------- | ------------ | ----- | ------------------------------------------------------------ | +| pid | thread | key | | 线程PID | +| tgid | thread | label | | 所属进程ID | +| comm | thread | label | | 线程所属进程名称 | +| off_cpu_ns | thread | Gauge | ns | task调度offcpu的最大时间,统计方式: 1. KPROBE finish_task_switch 获取入参prev task(pid)以及当前时间,当前CPU信息(bpf_get_smp_processor_id()),记录MAP(pid/cpu作为key); 2. finish_task_switch 中bpf_get_current_pid_tgid获取当前pid,以及当前CPU信息(bpf_get_smp_processor_id()),匹配步骤1中的数据以及计算时间差,得出一次offcpu时间。 注意: 1. 过滤idle(pid=0) 2. 只记录offcpu最大值 | +| migration_count | thread | Gauge | | task CPU之间迁移次数 | +| iowait_us | thread | Gauge | us | task IO操作等待时间(单位us) | +| bio_bytes_write | thread | Gauge | bytes | task发起bio写操作字节数 | +| bio_bytes_read | thread | Gauge | bytes | task发起bio读操作字节数 | +| bio_err_count | thread | Gauge | | task发起的bio结果失败的次数 | +| hang_count | thread | Gauge | | task发生io hang次数 | + +# Process + +| metrics_name | table_name | metrics_type | unit | metrics description | +| -------------------------- | ------------------ | ------------ | ---- | ------------------------------------------------------------ | +| tgid | | key | | 进程ID | +| ppid | system_proc | label | | 父进程ID | +| comm | system_proc | label | | 执行程序名称 | +| cmdline | system_proc | label | | 执行程序命令(包括配置) | +| container id | system_proc | label | | 进程归属的容器实例ID(简写) | +| proc_shared_dirty_size | system_proc | Gauge | | 进程共享属性的dirty page size | +| proc_shared_clean_size | system_proc | Gauge | | 进程共享属性的clean page size | +| proc_private_dirty_size | system_proc | Gauge | | 进程私有属性的dirty page size | +| proc_private_clean_size | system_proc | Gauge | | 进程私有属性的clean page size | +| proc_referenced_size | system_proc | Gauge | | 进程当前已引用的page size | +| proc_lazyfree_size | system_proc | Gauge | | 进程延迟释放内存的size | +| proc_swap_data_size | system_proc | Gauge | | 进程swap区间数据size | +| proc_swap_data_pss_size | system_proc | Gauge | | 进程物理内存swap区间数据size | +| fd_count | system_proc | Gauge | | 进程文件句柄 | +| fd_free_per | system_proc | Gauge | | 进程剩余FD资源占比% | +| proc_utime_jiffies | system_proc | Gauge | | 进程用户运行时间 | +| proc_stime_jiffies | system_proc | Gauge | | 进程系统态运行时间 | +| proc_minor pagefault_count | system_proc | Gauge | | 进程轻微pagefault次数(无需从磁盘拷贝) | +| proc_major pagefault_count | system_proc | Gauge | | 进程严重pagefault次数(需从磁盘拷贝) | +| proc_vm_size | system_proc | Gauge | | 进程当前虚拟地址空间大小 | +| proc_pm_size | system_proc | Gauge | | 进程当前物理地址空间大小 | +| proc_rchar_bytes | system_proc | Gauge | | 进程系统调用至FS的读字节数 | +| proc_wchar_bytes | system_proc | Gauge | | 进程系统调用至FS的写字节数 | +| proc_syscr_count | system_proc | Gauge | | 进程read()/pread()执行次数 | +| proc_syscw_count | system_proc | Gauge | | 进程write()/pwrite()执行次数 | +| proc_read_bytes | system_proc | Gauge | | 进程实际从磁盘读取的字节数 | +| proc_write_bytes | system_proc | Gauge | | 进程实际从磁盘写入的字节数 (page cache情况下,该字段进表示设置dirty page的size) | +| proc_cancelled_write_bytes | system_proc | Gauge | | 参考proc_write_bytes,因为存在page cache 如果write操作结束后,又发生文件被删除事件,会导致diry page并未写入磁盘,所以存在取消写的字节数统计 | +| ns_ext4_read | proc_ext4 | Gauge | ns | ext4文件系统读操作时间,单位ns | +| ns_ext4_write | proc_ext4 | Gauge | ns | ext4文件系统写操作时间,单位ns | +| ns_ext4_flush | proc_ext4 | Gauge | ns | ext4文件系统flush操作时间,单位ns | +| ns_ext4_open | proc_ext4 | Gauge | ns | ext4文件系统open操作时间,单位ns | +| ns_overlay_read | proc_overlay | Gauge | ns | overlayfs文件系统读操作时间,单位ns | +| ns_overlay_write | proc_overlay | Gauge | ns | overlayfs文件系统写操作时间,单位ns | +| ns_overlay_flush | proc_overlay | Gauge | ns | overlayfs文件系统flush操作时间,单位ns | +| ns_overlay_open | proc_overlay | Gauge | ns | overlayfs文件系统open操作时间,单位ns | +| ns_tmpfs_read | proc_tmpfs | Gauge | ns | tmpfs文件系统读操作时间,单位ns | +| ns_tmpfs_write | proc_tmpfs | Gauge | ns | tmpfs文件系统写操作时间,单位ns | +| ns_tmpfs_flush | proc_tmpfs | Gauge | ns | tmpfs文件系统flush操作时间,单位ns | +| reclaim_ns | proc_page | Gauge | ns | 进程触发的page回收时间(执行SWAP操作),单位ns | +| access_pagecache | proc_page | Gauge | | 进程触发的页面访问次数 | +| mark_buffer_dirty | proc_page | Gauge | | 进程触发的 page buffer置脏次数 | +| load_page_cache | proc_page | Gauge | | 进程触发的 page 加入page cache次数 | +| mark_page_dirty | proc_page | Gauge | | 进程触发的 page 置脏次数 | +| ns_gethostname | proc_dns | Gauge | ns | 进程获取DNS域名对应的地址,单位ns | +| gethostname_failed | proc_dns | Gauge | | 进程获取DNS域名失败次数 | +| ns_mount | proc_syscall_io | Gauge | ns | 进程系统调用mount时长,单位ns | +| ns_umount | proc_syscall_io | Gauge | ns | 进程系统调用umount时长,单位ns | +| ns_read | proc_syscall_io | Gauge | ns | 进程系统调用read时长,单位ns | +| ns_write | proc_syscall_io | Gauge | ns | 进程系统调用write时长,单位ns | +| ns_sendmsg | proc_syscall_net | Gauge | ns | 进程系统调用sendmsg时长,单位ns | +| ns_recvmsg | proc_syscall_net | Gauge | ns | 进程系统调用recvmsg时长,单位ns | +| ns_sched_yield | proc_syscall_sched | Gauge | ns | 进程系统调用sched_yield时长,单位ns | +| ns_futex | proc_syscall_sched | Gauge | ns | 进程系统调用futex时长,单位ns | +| ns_epoll_wait | proc_syscall_sched | Gauge | ns | 进程系统调用epoll_wait时长,单位ns | +| ns_epoll_pwait | proc_syscall_sched | Gauge | ns | 进程系统调用epoll_pwait时长,单位ns | +| ns_fork | proc_syscall_fork | Gauge | ns | 进程系统调用fork时长,单位ns | +| ns_vfork | proc_syscall_fork | Gauge | ns | 进程系统调用vfork时长,单位ns | +| ns_clone | proc_syscall_fork | Gauge | ns | 进程系统调用clone时长,单位ns | +| syscall_failed | proc_syscall | Gauge | | 进程系统调用失败次数 | +| | | | | | +| | | | | | + +# BLOCK + +| metrics_name | table_name | metrics_type | unit | metrics description | +| ----------------------- | ---------- | ------------ | ---- | ------------------------------- | +| major | block | key | | 块对象编号 | +| first_minor | block | key | | 块对象编号 | +| blk_type | block | label | | 块对象类型(比如disk, part) | +| blk_name | block | label | | 块对象名称 | +| disk_name | block | label | | 所属磁盘名称 | +| latency_req_max | block | Gauge | us | block层request时延最大值 | +| latency_req_last | block | Gauge | us | block层request时延最近值 | +| latency_req_sum | block | Gauge | us | block层request时延总计值 | +| latency_req_jitter | block | Gauge | us | block层request时延抖动 | +| count_latency_req | block | Gauge | | block层request操作次数 | +| latency_flush_max | block | Gauge | us | block层flush时延最大值 | +| latency_flush_last | block | Gauge | us | block层flush时延最近值 | +| latency_flush_sum | block | Gauge | us | block层flush时延总计值 | +| latency_flush_jitter | block | Gauge | us | block层flush时延抖动 | +| count_latency_flush | block | Gauge | | block层flush操作次数 | +| latency_driver_max | block | Gauge | us | 驱动层时延最大值 | +| latency_driver_last | block | Gauge | us | 驱动层时延最近值 | +| latency_driver_sum | block | Gauge | us | 驱动层时延最总计值 | +| latency_driver_jitter | block | Gauge | us | 驱动层时延抖动 | +| count_latency_driver | block | Gauge | | 驱动层操作次数 | +| latency_device_max | block | Gauge | us | 设备层时延最大值 | +| latency_device_last | block | Gauge | us | 设备层时延最近值 | +| latency_device_sum | block | Gauge | us | 设备层时延最总计值 | +| latency_device_jitter | block | Gauge | us | 设备层时延抖动 | +| count_latency_device | block | Gauge | | 设备层操作次数 | +| count_iscsi_tmout | block | Gauge | | iscsi层操作超时次数 | +| count_iscsi_err | block | Gauge | | iscsi层操作失败次数 | +| conn_err_bad_opcode | block | Gauge | | iscsi tp层错误操作码次数 | +| conn_err_xmit_failed | block | Gauge | | iscsi tp层发送失败次数 | +| conn_err_tmout | block | Gauge | | iscsi tp层超时次数 | +| conn_err_connect_failed | block | Gauge | | iscsi tp层建链失败次数 | +| count_sas_abort | block | Gauge | | iscsi sas层异常次数 | +| access_pagecache | block | Gauge | | Block页面访问次数 | +| mark_buffer_dirty | block | Gauge | | Block page buffer置脏次数 | +| load_page_cache | block | Gauge | | Block page 加入page cache次数 | +| mark_page_dirty | block | Gauge | | Block page 置脏次数 | + +# Container + +| metrics_name | table_name | metrics_type | unit | metrics description | +| -------------------------------------- | ----------------- | ------------ | ------- | ------------------------------------------------------------ | +| container_id | container | key | | 容器ID(简写) | +| name | container | label | | 容器名称 | +| cpucg_inode | container | label | | cpu,cpuacct cgroup ID(容器实例内cgroup目录对应的inode id) | +| memcg_inode | container | label | | memory cgroup ID(容器实例内cgroup目录对应的inode id) | +| pidcg_inode | container | label | | pids cgroup ID(容器实例内cgroup目录对应的inode id) | +| mnt_ns_id | container | label | | mount namespace | +| net_ns_id | container | label | | net namespace | +| proc_id | container | label | | 容器主进程ID | +| blkio_device_usage_total | container_blkio | Gauge | bytes | Blkio device bytes usage, unit bytes | +| cpu_load_average_10s | container_cpu | Gauge | | Value of container cpu load average over the last 10 seconds | +| cpu_system_seconds_total | container_cpu | Gauge | seconds | Cumulative system cpu time consumed, unit second | +| cpu_usage_seconds_total | container_cpu | Gauge | seconds | Cumulative cpu time consumed, unit second | +| cpu_user_seconds_total | container_cpu | Gauge | seconds | Cumulative user cpu time consumed, unit second | +| fs_inodes_free | container_fs | Gauge | | Number of available Inodes | +| fs_inodes_total | container_fs | Gauge | | Total number of Inodes | +| fs_io_current | container_fs | Gauge | | Number of I/Os currently in progress | +| fs_io_time_seconds_total | container_fs | Gauge | seconds | Cumulative count of seconds spent doing I/Os, unit second | +| fs_io_time_weighted_seconds_total | container_fs | Gauge | seconds | Cumulative weighted I/O time, unit second | +| fs_limit_bytes | container_fs | Gauge | bytes | Number of bytes that can be consumed by the container on this filesystem, unit bytes | +| fs_read_seconds_total | container_fs | Gauge | bytes | Cumulative count of bytes read, unit bytes | +| fs_reads_bytes_total | container_fs | Gauge | bytes | Cumulative count of bytes read | +| fs_reads_merged_total | container_fs | Gauge | | Cumulative count of reads merged | +| fs_reads_total | container_fs | Gauge | | Cumulative count of reads completed | +| fs_sector_reads_total | container_fs | Gauge | | Cumulative count of sector reads completed | +| fs_sector_writes_total | container_fs | Gauge | | Cumulative count of sector writes completed | +| fs_usage_bytes | container_fs | Gauge | bytes | Number of bytes that are consumed by the container on this filesystem | +| fs_write_seconds_total | container_fs | Gauge | seconds | Cumulative count of seconds spent writing | +| fs_writes_bytes_total | container_fs | Gauge | bytes | Cumulative count of bytes written | +| fs_writes_merged_total | container_fs | Gauge | | Cumulative count of writes merged | +| fs_writes_total | container_fs | Gauge | | Cumulative count of writes completed | +| memory_cache | container_memory | Gauge | bytes | Total page cache memory | +| memory_failcnt | container_memory | Gauge | | Number of memory usage hits limits | +| memory_failures_total | container_memory | Gauge | | Cumulative count of memory allocation failures | +| memory_mapped_file | container_memory | Gauge | bytes | Size of memory mapped files | +| memory_max_usage_bytes | container_memory | Gauge | bytes | Maximum memory usage recorded | +| memory_rss | container_memory | Gauge | bytes | Size of RSS | +| memory_swap | container_memory | Gauge | bytes | Container swap usage | +| memory_usage_bytes | container_memory | Gauge | bytes | Current memory usage, including all memory regardless of when it was accessed | +| memory_working_set_bytes | container_memory | Gauge | bytes | Current working set | +| network_receive_bytes_total | container_network | Gauge | bytes | Cumulative count of bytes received | +| network_receive_errors_total | container_network | Gauge | | Cumulative count of errors encountered while receiving | +| network_receive_packets_dropped_total | container_network | Gauge | | Cumulative count of packets dropped while receiving | +| network_receive_packets_total | container_network | Gauge | | Cumulative count of packets received | +| network_transmit_bytes_total | container_network | Gauge | bytes | Cumulative count of bytes transmitted | +| network_transmit_errors_total | container_network | Gauge | | Cumulative count of errors encountered while transmitting | +| network_transmit_packets_dropped_total | container_network | Gauge | | Cumulative count of packets dropped while transmitting | +| network_transmit_packets_total | container_network | Gauge | | Cumulative count of packets transmitted | +| oom_events_total | container_oom | Gauge | | Count of out of memory events observed for the container | +| spec_cpu_period | container_spec | Gauge | | CPU period of the container | +| spec_cpu_shares | container_spec | Gauge | | CPU share of the container | +| spec_memory_limit_bytes | container_spec | Gauge | bytes | Memory limit for the container | +| spec_memory_reservation_limit_bytes | container_spec | Gauge | bytes | Memory reservation limit for the container | +| spec_memory_swap_limit_bytes | container_spec | Gauge | bytes | Memory swap limit for the container | +| start_time_seconds | container_start | Gauge | seconds | Start time of the container since unix epoch | +| tasks_state | container_tasks | Gauge | | Number of tasks in given state (sleeping, running, stopped, uninterruptible, or ioawaiting) | +| | | | | | + +# DISK + +| metrics_name | table_name | metrics_type | unit | metrics description | +| ------------ | ------------- | ------------ | --------------------- | --------------------------------------- | +| disk_name | system_iostat | key | | blk所在的物理磁盘名称 | +| rspeed | system_iostat | gauge | read bytes/second | 读速率(IOPS) | +| rspeed_kB | system_iostat | gauge | read kbytes/second | 吞吐量 | +| r_await | system_iostat | gauge | ms | 读响应时间 | +| rareq | system_iostat | gauge | | 饱和度(rareq-sz 和 wareq-sz+响应时间) | +| wspeed | system_iostat | gauge | write bytes/second | 写速率(IOPS) | +| wspeed_kB | system_iostat | gauge | write kbytes/second | 吞吐量 | +| w_await | system_iostat | gauge | ms | 写响应时间 | +| wareq | system_iostat | gauge | | 饱和度(rareq-sz 和 wareq-sz+响应时间) | +| util | system_iostat | gauge | % | 磁盘使用率 | + diff --git a/system_integration.png b/system_integration.png new file mode 100644 index 0000000000000000000000000000000000000000..8cf52f0c01a6bbf2f30389f260114173c911073a GIT binary patch literal 6409 zcmds6XH-+$wvHG`sD>g)4}u~hC?F6zp$7z{OP3GI&4VvcceEiI0(qIlxbH*@zSCp%tOFqs zmR9O@q}{*B6#_Z&1fzvEzu~Z)Efvp?xp`Y@?2_ddCFPN9%U~p@XR4-8WwhlNM47JmgHSzhLhmJWjzP-K)-+oV<7SzV&(UpQpaPKEY8T^=@t2 zVt8tJc7Nh+fjkCC5Tq@qP(2~!hh1~{pYe#w zA-4}l*sihyX>ayaw9pU7&nc?YAW?97AUaPSi(rGoFi|9=2%IIAhIp-g{+7v7PHob2 zkR*n%3r3=DGe9%=;Ve&Sh%uzu)%$NQ)93q)t4SPp8K>JD;_~Bb8CnJeig=QI4Xvv# z-q)CF8J))Oc@wQO%8tnV76LI!5|{PNf1o?Ab>B30?*u-*XR38Xd7(aYKvC@2-mZ6z z#nB~e_oGC#DYE|Q#gdo#gKkB}p#`o(CoY@hC@yd8T#nq*Ak)bJ)HP+P<)hB=genBQ z{JvXKS&hafbHF9hMq9kO z>BdeIU*(}N&nQxA-p>11hQaKIQlpa(7YP%h^4>*EZY&DwefCAPY`69H-P&wlTc~zt zC;DIh9jS8~SlsI7@&r!aU-jlMJ%`!RYB#M-dH8M8?uWL6{hf8eTia-dbu)F4hD6p2 zEPeUtT5x;GAE6#ivTJC=cj_D&n?|ZL1Kv{R`5vfGjP=#d;|}nOV=e zs`L!>9(%>eiw@8=eG}_$im)TWO+|!-LY##4R`m2oe_mn$82Qi%t|Bv%KG05n3_b4? z3>q6Hte0}{&ScDeRpWb$Uh?_m2 z42U`^Eq7`sPxd%oUVmU*^7%0blnVuLbn3Zv+3OKiB5u^6wkJI=$pABc6hH&Z-l2!p zj3S-Z?SQH%o(UFwg1Yaib2wWj-FWAz3Rx_y%7p#R-|Ovus06K3Sm)vZaO&#B&Ia!a z0hQ`3(_1l8=8R6)isnjqXG8d+1@b>u-y(QjRj|n3{^o@dt}kdRQS`1w2rz-Jdi;p6 z`ziW#k##V4zMJLu+2aI@p$Mkd#p&finZRMc7dLsz{9Hh1%1IRMrokk0qPVj*d`5gx zj8nz$?yZ%d@5{TxW*2&NcrCroe5$+i13JiRO=Nf}Afi94hTo7p^z0j+cQMM)c(1jH z$J?@n;+x_S64-w*!mFKplDqT4q_*DW9nNnbeXr$Odo9xYb?taERz6x0i4%bI*qKDX&P$6D6g-)~`9l~n4&znN)M z7?jT(@km-+!aNjJwWCATb7!(AWI3BEbE)HH8o6mlR6BB1` z5+73=ZZI*rRT$@P>PAWMnyLzjRMI3{O{^~b?8vliipgx*ZBh21t6lrJX8ZW%ySknCD^n0A1=AWr%&Nzso5^qH`Pc#J_3dlLaQBI_Vbed8RDPU7X*=O>%`q> z{Hlm7C;b)Y(;l|eL_2$?QUDoCoP2cfB2UgsQtL#WzZ&_MS*YwB`zl}a zMc1&sC2Zg435^6GEr?5n{PfX+7|0eKKmV-DTz>%vH~y+>ELoDk_l_yAeBwLjCuF+XFAQcIcEQ1Gj@yz%+tS6Yo~TB6y*o2w&a+`L??$H%w_Lc=l{txWBG z&rRwsvDM498;G#HvG6rXZH2*zr{@~{{qjUz&%wU7iE`-L+p-%m)sMWjg;aV@F!p^m zXk9O?pZmFdsyxzldvn4+4l0^)sn7{A`3e~$q;}AgV5N7r8Ydu5b+xkm&VUz`7l zdHM-1+*1}Hz2<^)vG*A^SM#8;Bu3!wpL3@jZSEscKWT|=Zv)fLm+K(-wgHm!msgz` zmcnqBa5PqGb#lg_lobkVM%F*N;kTAQ3!uww#`oQ@ATR~(P| zbiA>SH~v1YD3gNZ zSLsfDERe0S#0e85(%ZK?)g9l48+})l{2|?Ghy{$CK6x527!&DhzrL98Ql=FhP3&sE zB+Ze9ROnLj#e6=GLMyE-EinW^1f#$aI)e7l0r!T9@Bh@J4}00DA-<#FY^^dfNw$g(OSBS!!c4 z+T|0-W`Q&p=%BE9)3eL(Rt*-Jh(r+tUmnP4jNmJqz21VTdVtS-FFlTGu&+y(I9_vi z$*-U($Qe85-y2hj=^JcpOL#f#KGMEi)qs~32%serZK~OP(BIq7E@zIjFH|d587=wg z7>{5AJ)R8j9_z8*7#S)2E_hYEN3qIXSw#+(d>+m+GVm}aRjZvd+rhVZh|9)hA#G%| zz+jKL!QD&mV05$4GlH{}KL7^G9K`1vsx+;+`;PvMf zHxFLjkkZ~RZ#TVC+c~X}v+X09ic|QT)o+Jx5x$j$_=LV%C{NtL$OMJSb#Opo-X@s} zmJiPq*YpS73CJq2G$Po}!Xx!~UMwzt=u*_)<2^odQM3B(o( zMAsluqn4m&K|j?Vp3MFZ%j$7z-w(=1Bj{t{@9VvUw+Tfi+w+AXExQ&wnZnN~l4tJ; z4NYs?!{L$)qM|<%o`?6ag1ZD%o*pW%wnnP0GGj1u2n>(c#7d#DY3kVZT*2wP=$xJwVNE@ ztq=L}DVa2rx+0w2^J1BXh=wC3MN*3zT3cJaKUQEcaN{S+QlVAJIuRBP7D>3vT6pi%AD#die`lwn>6&a@@S? ziaw=Ncb5LbaMMS;BktEPA*=kPL&z`1LEL+0&FlPEsk4d}ES3BS%cp0yj|bI!ePMe= z`h0SzUg<)t=_AcGX)^z0GKfmP89U^9vfXfy*(1!_8d*2C0wjjJ+JZ#q;W4pp(a;?R zqZj;fQ{PVa(L9dRDm5y+P&#u>;$O(sMz(_FFDp|@y^f)JBEi8Ftv8+`rit0vIga&M zv_p2VbK(`fsQ9P%O634GUN~g_Bp_We4u`@Z^Zz|H{z9zVEO_2_J{Za>%U`lTDO|t| z{SKy4*`-zUv8{KjUEjb8A_`}5yOvOBv^89Xqa&gr2)^!w+2R4mVRoA_pMI-yx~Yr7 z$kz-YS$|gSRDlhquzgtGWUsh<3+!FhZy%xZZ<>4N?$&uryk8tn7a-o}g80%C*c!**qg0aB zDC?2r;U5%jhaQE-u#kWNQ;wL%%s{=PP_Dan&@Huty~s#%h*~^p#v;*hv2tg=Fef-T zSPQ#$w!Ppu8-&<)l8ZPGB18GD=dZ80*tmM|Gs(*i4eZ8O9E-r0JG-lft(tb=r;01C zLMRYXuw#F09>6_WfFIVp%aYmc2Chehx{RTfvy`@GycuPZ^WdRHjrfiJR3?=Zh zf*$4^!fpjw4;4D~QXTs$n#6YJ!KK;u^IHkdKe_~a=D`!|?@^ZXEBFswfH+16>%D&b z7&HT!m)BFUYjX_1MenT1@*fhI0bUre$YzD5S7|7?blSqojZlA};TzIJ<{DUfB7end z!N4n(t|_crb>QRs0zZs(pzLywYk{u`z-#BA#Jrro#Cy^_H*>=-8y~2p8m*9^9>S|3 zs6by~bi+)%fHP~VU(xrYz5DcojIBEq;FZfk$Vm_nf0Flz><2GwsA+E%hx*mO1240A}=*(QFU%f@DFL!@21jh9)R~t=; ze`s3e+JezZX2vj77#y%v!Fb99vO_#QKIQW?iV%aON{7naLS6BpToJ=PWofM9ymN`+ zmjf@}Yv5m8F9}?vEbYxt;v(j}{gm^A6s$PU)?2oXmQ9*7l(vuC=P5ptG+>(~t(DB4 z!&@FmSS!2*Wu(-I#m67Y2vf=q^Ixe5^A4V~kgxcprRQqjn?6fl5!O>sC)54Z!SmEc$T)V>;%lW>zjdXUzGihm3o${_JySH3Eq);+QDw`l*QDFJN-lnwr zD(4y0AfdDQx}VYHV@2NlrYiHTf))zbpkPbmbf)o!LjuY2Jl;CFsmU_flF7tMF#=aH zwKx|WjsPv&`@_m-s5hT;+4p;e{#JPLIa4gaa}~RZ4#LQX$R!Y}u17!fsw==}J`_ed zwXJ1(XTRq^@jTNuyNffd*X!w?>W7rRtBJ9H_`475EZbH6QTuGds;Tc4UXRArA_!A; zhA5HR4=j4W0!aD@J8Ig;n|6P=koa%}Zgio-6!oS=^0Aw2x{l3c=tBnJ zE-lc-FLhS|a7F$b8UG{r{I8js_rKM(G>7jD41D=UoJL8!ZMz#eQ{By4Io;hG&5A@h zy7pz0Kl1|d?0-`be5^8HSwv)=hbP8!Q|o9i7gYBx6ScO_F#^KQbku^qPD^B!I!Z0V zJC#?75fscHb14(3kPe)K%FWq}l4t&sP%)_LjYL7^k5H#BR`>D1S%`)Z;toJkT&|)6 zG_maJ*k)HzlvMn;g%7PtZEV##)+yAiZcyH za3!BJqx+Q7xCqCLC%H4X#&^Ni8EIsX?dvCrw8Ei^0paVIDP#erc%I@jyzQhej-}e4) zqp;<&V!+$b)KMF203+_Crjv$TtWL$4BJ7!qMQdtFyLd&Z3sqkJ5!$lLCB_@#wVSwL z=*P@`oGz?E6gUc