# prometheus **Repository Path**: bai-xyz/prometheus ## Basic Information - **Project Name**: prometheus - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 3 - **Created**: 2025-01-07 - **Last Updated**: 2025-01-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [TOC] --- # 开箱即用的告警规则 https://awesome-prometheus-alerts.grep.to/ https://blog.csdn.net/weixin_43798031/article/details/127488164 # prometheus ## 组件 1. prometheus server 1. Push gateway 接收指标数据的网关 2. storage 内置tsdb数据库 3. rules and alerts 规则 4. service discovery 动态发现待监控的target 2. alertmanager 3. exporter 4. grafana ## 客户端采集方式pull/push 1. pull主动拉取的形式指的是客户端安装各类已有得exporter（是一个http_server可对http请求做出响应返回k/v数据）在系统上，采集数据； 2. push被动推送的形式客户端（服务端）安装官方提供的pushgateway插件，使用运维自行开发的各种脚本，把监控数据组织成k/v的形式以metrics形式发送给pushgateway后，pushgateway再推送给Prometheus； ## prometheus 相关exporter K8S 生态的组件都会提供/metric接口以提供自监控，这里列下我们正在使用的： cadvisor: 集成在 Kubelet 中。 kubelet: 10255为非认证端口，10250为认证端口。 apiserver: 6443端口，关心请求数、延迟等。 scheduler: 10251端口。 controller-manager: 10252端口。 etcd: 如etcd 写入读取延迟、存储容量等。 docker: 需要开启 experimental 实验特性，配置 metrics-addr，如容器创建耗时等指标。 kube-proxy: 默认 127 暴露，10249端口。外部采集时可以修改为 0.0.0.0 监听，会暴露：写入 iptables 规则的耗时等指标。 kube-state-metrics: K8S 官方项目，采集pod、deployment等资源的元信息。 node-exporter: Prometheus 官方项目，采集机器指标如 CPU、内存、磁盘。 blackbox_exporter: Prometheus 官方项目，网络探测，dns、ping、http监控 process-exporter: 采集进程指标 nvidia exporter: 我们有 gpu 任务，需要 gpu 数据监控 node-problem-detector: 即 npd，准确的说不是 exporter，但也会监测机器状态，上报节点异常打 taint 应用层 exporter: mysql、nginx、mq等，看业务需求。 `四个黄金信号`：延迟、流量、错误数、饱和度 # 部署 ## server部署 ### 二进制部署下载：https://prometheus.io/download/ 启动参数：[参考](https://www.cnblogs.com/zhoujinyi/p/11934062.html) ``` 检查配置文件： ./promtool check config ./prometheus.yml 启动： ./prometheus --config.file=./prometheus.yml --web.listen-address="0.0.0.0:9090" --web.enable-lifecycle --log.level=warn --web.enable-admin-api --storage.tsdb.wal-compression --storage.tsdb.path=./data --storage.tsdb.retention.time=15d --web.read-timeout=5m --web.max-connections=512 参数详解： --storage.tsdb.path=/prometheus # 指标(数据）存储的基本路径 --web.enable-lifecycle # 启用是否通过HTTP请求重新加载 --web.listen-address="0.0.0.0:9090" --web.read-timeout=5m # 空闲连接的超时时间,防止太多空闲链接占用资源 --web.max-connections=512 # 最大连接数 --web.external-url= # 可从外部访问Prometheus的URL,比如反向代理 --web.cors.origin=".*" --log.level=warn --web.enable-admin-api # 管理api 数据删除清理 --storage.tsdb.retention.time=15d # 数据保留的天数，默认15天 --storage.tsdb.wal-compression # 压缩tsdb WAL --rules.alert.for-grace-period=10m #警报和恢复的“ for”状态之间的最短持续时间。 ``` ### docker-compose部署 ``` version: '3.8' services: prom: container_name: prometheus image: prom/prometheus:v2.47.1 restart: on-failure:5 hostname: prometheus ports: - 9090:9090 command: - "--config.file=/etc/prometheus/prometheus.yml" # 用于登陆认证 - "--web.config.file=/etc/prometheus/config.yml" - "--web.enable-lifecycle" - "--log.level=warn" - "--web.enable-admin-api" - "--web.console.libraries=/etc/prometheus/console_libraries" - "--web.console.templates=/etc/prometheus/consoles" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=7d" environment: TZ: Asia/Shanghai volumes: - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime - ./config/prometheus.yml:/etc/prometheus/prometheus.yml # 用于登陆认证 htpasswd -nBC 12 '' | tr -d ':\n' #cat > ./config.yml< $basePath/$softVersion/node_exporter.log 2>&1 & ``` ## pushgateway 数据采集脚本 pushgateway 是一种采用被动推送的方式，单独运行在任意节点。 > 优缺点 > > > 使用它的原因主要是： > > * Prometheus 采用 pull 模式，可能由于不在一个子网或者防火墙原因，导致 Prometheus 无法直接拉取各个 target 数据。 > > > > * 在监控业务数据的时候，需要将不同数据汇总, 由 Prometheus 统一收集。 > > > 缺点有： >> * 将多个节点数据汇总到 pushgateway, 如果 pushgateway 挂了，受影响比多个 target 大。 >> * Prometheus 拉取状态 up 只针对 pushgateway, 无法做到对每个节点有效。 >> * Pushgateway 可以持久化推送给它的所有监控数据。因此，**即使你的监控已经下线，prometheus 还会拉取到旧的监控数据，需要手动清理 pushgateway 不要的数据。** >> * curl -X DELETE http://127.0.0.1:9091/metrics/job/Ping_check/instance/y （del单个） >> * curl -X DELETE http://172.16.11.198:9091/metrics/job/Ping_check（del一组） >> * curl -X PUT http://172.16.11.198:9091/api/v1/admin/wipe（del all） * 下载安装 ``` prometheus.io 下载 ./pushgateway --web.enable-admin-api 启动 Prometheus配置 - job_name: 'pushgateway' static_configs: - targets: - 127.0.0.1:9091 labels: instance: pushgateway ``` * 采集脚本范例（等待连接数） ```shell #!/bin/bash instance_name=`hostname -f | cut -d '.' -f1` # 本机机器名变量用于之后得标签 if [ $instance_name == "localhost" ];then # 要求机器名不能是localhost 不然标签无法区分 echo "Must FQDN hostname" exit 1 fi label="count_netstat_wait_connections" # 定义一个新的key count_netstat_wait_connections=`netstat -an | grep -i wait | wc -l` # 定义一个value # 推送多个数据 echo "$label $count_netstat_wait_connections" echo "$label $count_netstat_wait_connections" | curl --data-binary @- http://172.16.11.198:9091/metrics/job/pushgateway1/instance/$instance_name # 自定义label及推送多个指标 cat <{//} # 推送数据定义 ``` * shell 获取ping数据 1. 关键命令使用 ```shell # 延迟 timeout 5 ping -q -A -s 500 -W 1000 -c 10 127.0.0.1 | grep rtt | awk -F '/' '{print $5}' -s 包大小默认64 -W 延迟timeout -c 发送多少个数据包 # 丢包 timeout 5 ping -q -A -s 500 -W 1000 -c 10 127.0.0.1 | grep transmitted | awk '{print $6}' ``` 2. 获取指标上传到pushgateway ```shell #!/bin/bash # 捕获异常信号，执行优雅操作 trap 'rm -f tmp_push_prometheus.txt && curl -X PUT http://$pushgateway_addr/api/v1/admin/wipe && exit' ERR EXIT SIGINT # 相关指标 ping_lost_package="ping_lost_package" # 定义一个丢包率label ping_rtt_avg="ping_rtt_avg" # 定义一个延迟lable pushgateway_addr="172.16.11.198:9091" # 监控下线，删除指标 del_value() { curl -X DELETE http://$pushgateway_addr/metrics/job/Ping_check/instance/$1 } # 推送指标到pushgateway push_prometheus(){ curl -X POST --data-binary @tmp_push_prometheus.txt http://$pushgateway_addr/metrics/job/Ping_check/instance/$1 } # 获取指标并上传到pushgateway func() { x=0 # 计数器（用于对应两个数组的值） for ip in ${ips[*]}; do for label in ${ipLabels[*]}; do echo $ip ${ipLabels[$x]} ping_lost_package=$(timeout 5 ping -q -A -s 500 -W 1000 -c 10 $ip | grep transmitted | awk '{print $6}' | awk -F '%' '{print $1}') ping_rtt_avg=$(timeout 5 ping -q -A -s 500 -W 1000 -c 10 $ip | grep rtt | awk -F '/' '{print $5}') # 判断value是否有值,为空，清理pushgateway数据 if [ ! -n $ping_lost_package ]; then del_value $1 break fi if [ ! -n $ping_rtt_avg ]; then del_value $1 break fi echo "ping_lost_package{label=\"${ipLabels[$x]}\",env=\"pro\"} $ping_lost_package" >>tmp_push_prometheus.txt echo "ping_rtt_avg{label=\"${ipLabels[$x]}\",env=\"pro\"} $ping_rtt_avg" >>tmp_push_prometheus.txt break done x=$(expr $x + 1) done # 推送指标到pushgateway if [ -f tmp_push_prometheus.txt ]; then push_prometheus $1 # 删除临时文件 rm -f tmp_push_prometheus.txt else echo "临时文件未生成" fi # 删除历史数据，防止旧数据造成异常 del_value $1 } while true;do ips=(172.16.11.198 127.0.0.1 baidu.com) ipLabels=(test localhost baidu) func x ips=(102.932.115.447) ipLabels=(law) func y sleep 15 done ``` * smokeping 获取ping指标 1. 搭建smokeping ``` docker run --name smokeping -d --rm -p 8888:80 -e PUID=1000 -e PGID=1000 -v /data/smokeping/data:/data/ -v /data/smokeping/config:/config -e TZ=Asia/Shanghai linuxserver/smokeping ``` 2. 配置smokeping ``` cat config/Database step = 60 pings = 60 cat config/Targets *** Targets *** probe = FPing menu = Top title = Network Latency Grapher remark = Welcome to the SmokePing website of WORKS Company. \ Here you will learn all about the latency of our network. + targets menu = Targets ++ baiduURL menu = baidu URL title = baidu URL server host = www.baidu.com ++ test menu = test title = test host = 172.16.11.198 重启smokeping容器，并查看日志，确保容器运行正常 ``` 3. 获取指标 ```python 命令行获取，不太懂 rrdtool fetch ./data/targets/test.rrd AVERAGE python3文件获取(未编写推送指标部分) collection_to_prometheus.py #coding:utf-8 import rrdtool import os paras = { 'prometheus_gateway' : 'http://192.168.56.101:9091' , 'data_dir' : '/etc/smokeping/data/' # 指定rrdtool数据目录，其下为子目录/etc/smokeping/data/targets/*.rrd } # 通过rrdtool获取指标 def getMonitorData(rrd_file): rrd_info = rrdtool.info(rrd_file) last_update = rrd_info['last_update'] - 60 args = '-s ' + str(last_update) results = rrdtool.fetch(rrd_file , 'AVERAGE' , args ) lost_package_num = results[2][0][1] average_rrt = 0 if not results[2][0][2] else results[2][0][2] * 1000 return lost_package_num , round(average_rrt , 4) if __name__ == '__main__': ISP_list = ['targets'] for ISP in ISP_list: rrd_data_dir = os.path.join(paras['data_dir'] , ISP) for filename in os.listdir(rrd_data_dir): (instance , postfix) = os.path.splitext(filename) if postfix == '.rrd' : (lost_package_num , rrt) = getMonitorData(os.path.join(paras['data_dir'] , ISP , filename)) print(rrd_data_dir,instance,rrt,lost_package_num) ``` ## 物理部署Prometheus监控k8s 参考： https://www.jianshu.com/p/fc5624a30580 **k8s集群需安装metrics-server state-metrics服务** 1. **创建rbac对象** cat prom.rbac.yaml ```yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch - apiGroups: - "extensions" resources: - ingresses verbs: - get - list - watch - apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system ``` 2. **获取secret信息** ```shell kubectl get sa prometheus -n kube-system -o yaml kubectl describe secret prometheus-token-wj7fb -n kube-system ``` 将token保存到文件中 k8s.token 3. **Prometheus配置收集cadvisor指标** ``` - job_name: k8s-cadvisor honor_timestamps: true metrics_path: /metrics scheme: https kubernetes_sd_configs: # kubernetes 自动发现 - api_server: https://172.16.11.198:6443 # apiserver 地址 role: node # node 类型的自动发现 bearer_token_file: k8s.token tls_config: insecure_skip_verify: true bearer_token_file: k8s.token tls_config: insecure_skip_verify: true relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - separator: ; regex: (.*) target_label: __address__ replacement: 172.16.11.198:6443 action: replace - source_labels: [__meta_kubernetes_node_name] separator: ; regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor action: replace ``` 4. **Prometheus配置收集pod指标** 需在被收集pod上面做配置，达到自动发现目的 ``` - job_name: kubernetes-pods honor_timestamps: true metrics_path: /metrics scheme: https kubernetes_sd_configs: - api_server: https://172.16.11.198:6443 role: pod bearer_token_file: k8s.token tls_config: insecure_skip_verify: true bearer_token_file: k8s.token tls_config: insecure_skip_verify: true relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] separator: ; regex: "true" replacement: $1 action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] separator: ; regex: (.+) target_label: __metrics_path__ replacement: $1 action: replace - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] separator: ; regex: ([^:]+)(?::\d+)?;(\d+) target_label: __address__ replacement: $1:$2 action: replace - separator: ; regex: __meta_kubernetes_pod_label_(.+) replacement: $1 action: labelmap - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_pod_name] separator: ; regex: (.*) target_label: pod_name replacement: $1 action: replace ``` 被收集pods yaml配置 ``` 对应pod的自动收集需要在deploy的部署文件中增加metadata 在deployment中的 template->metadata->annotations 增加 prometheus.io/scrape: "true" prometheus.io/port: "8081" # pod端口 prometheus.io/path: "/actuator/prometheus" kubernetes会查找prometheus.io/scrape=true注释的pod 。如果可以使用此注释，则Prometheus会自动获取指标。 ``` 5. **Prometheus配置收集kubelet指标** ``` - job_name: kubelet honor_timestamps: true metrics_path: /metrics scheme: https kubernetes_sd_configs: # kubernetes 自动发现 - api_server: https://172.16.11.198:6443 # apiserver 地址 role: node # node 类型的自动发现 bearer_token_file: k8s.token tls_config: insecure_skip_verify: true bearer_token_file: k8s.token tls_config: insecure_skip_verify: true relabel_configs: - action: replace target_label: __address__ replacement: 172.16.11.198:6443 - action: replace source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics ``` ## Prometheus联邦集群参考：https://www.cnblogs.com/zyyang1993/p/16621158.html # 数据采集 ## 服务发现 * 基于文件服务发现 * 基于dns服务发现 * 基于api服务发现 * 基于consul服务发现 ### 基于文件的服务发现 **file_sd_configs** * 略优于静态配置的服务发现，不依赖任何平台或第三方服务 * Prometheus server定期从文件中加载target信息，可用json或yaml ``` - job_name: 'prometheus' file_sd_configs: - files: - targets/prometheus*.yaml refresh_interval: 2m # 每隔2分钟自动重新加载一次文件中定义的targets，默认5分钟 [root@localhost]# cat prometheus_target.yaml - targets: - 172.1.1.1:9100 - 172.1.1.2:9100 labels: app: node-exporter job: node ``` ### 基于dns的服务发现 **dns_sd_configs** 基于Dns的服务发现针对一组dns域名进行定期查询 ### 基于api的服务发现 **kubernetes_sd_configs** ### 基于consul的服务发现 **consul_sd_configs** Prometheus 通过 consul 实现自动服务发现 [传送门](https://mp.weixin.qq.com/s?__biz=MzU4MjQ0MTU4Ng==&mid=2247490418&idx=1&sn=7c97b1cfdca17160743190178fa46111&chksm=fdb9146fcace9d79c054d8aa64d9c6825bb1620dbbde39ecf2b32a8bbdd0789937546e8722eb&scene=126&sessionid=1614246824&key=ee962fa21a68b6fee9bec60ae2cacfb0ce8fe19d372152024b982373821f8a25ffd5363ae65b758b3f068b728bba8c1a294573e03c83b17876faed8f8032b6174b34ea2fdb891169d82ebafca55bc6799db4676ee52e9ec13d69b46e976614a3c463bf78ba34dd730801251d658abdfb37b9f28afc27caaf2018e5bf7e0a8090&ascene=1&uin=MTYzNjg4OTIyMg%3D%3D&devicetype=Windows+10+x64&version=62090538&lang=zh_CN&exportkey=Ab4JHj22kp12Sp%2Bwnuxw8BQ%3D&pass_ticket=N0A8zEKP9fhjpm90MuhouL4kymC125dlY4h%2FUicqSkEgSjdBV6nB5K%2F1qZtjwD3B&wx_header=0) #### consul二进制部署下载地址: https://www.consul.io/downloads 安装：unzip consul_1.9.2_linux_amd64.zip -d /usr/local/bin 启动开发者模式： mkdir -pv /consul/data consul agent -dev -ui -data-dir=/consul/data -config-dir=/etc/consul/ -client=0.0.0.0 正常应该使用server模式 #### k8s consul部署 https://github.com/gupf0719/consul