# prometheus

**Repository Path**: fearless11/prometheus

## Basic Information

- **Project Name**: prometheus
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: https://github.com/prometheus
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-07-01
- **Last Updated**: 2025-05-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: Prometheues

## README

[toc]

# prometheus

## 简介

- https://prometheus.io/docs/introduction/overview/
- https://github.com/prometheus/prometheus
- Prometheus 是的一套开源监控告警方案，最初 2012 年由 SoundCloud 公司构建，在 2016 年加入 CNCF（云原生计算基金会）成为继 Kubernetes 后的第二个托管项目
- 适用：记录基于时间序列的数据，架构可靠，方便部署。比如：服务器监控、微服务监控
- 不适应：数据要求 100%准确的数据，比如：金融相关的请求量计费

## 架构

![arch-prome](./png/arch-prome.png)
![arch-prome-thanos](./png/arch-prome-thanos.png)
![arch-thanos](./png/arch-thanos.png)
![prome-block](./png/prome-block.png)

## 术语

- DATA MODEL：数据模型，由 metric 指标和 labels 标签 KV 组成，命名规则`[a-zA-Z_:][a-zA-Z0-9_:]*`，如 `http_requests_total{ method = "post"} 500`
- Metric types：counter 只增不减（如访问量）、gauge 可增可减（如温度）、histograme 计算一段内时间数据所在范围的量（如 95%、99%的响应耗时）、summary 计算一段内时间数据所在范围的百分比（如 95%、99%的响应耗时百分比）
- instances：一个采集实例
- jobs：一组采集实例
- sample = metrics + labels + time + float64 value

![prome-concept](./png/prome-concept.png)

## 选型

`功能是否满足、生态是否完善、社区是否活跃、是否灵活、维护是否方便、性能是否强大`

- 成熟的社区支持。Prometheus 是开源监控软件且社区活跃，很好的与云原生环境搭配。
- 完善的生态。开箱即用的组件，提供了多种语言的客户端 SDK。
- 易于部署和运维。Promehteus 组件是二进制文件无其他依赖，部署方便。
- Pull 拉取监控数据。方便集中管理，同时利于控制中心压力。
- 强大的数据模型。支持数据多维度制定，方便数据聚合与计算。
- 强大的查询语言 PromQL。支持对数据查询、计算、聚合、告警。
- 高性能。Prometheus 单一实例即可处理数以百万计的监控指标，每秒处理数十万的数据。

## 流程

- 数据采集 -> 数据处理 -> 数据告警/数据展示
- exporter/metrics -(consul)-> prometheus --> thanos --> altermanager、grafana
- 数据采集：exporter/pushgateway 提供 HTTP 接口让 Prometheus 采集监控数据
- 数据存储：默认是本地存储，可支持远程写
- 数据告警：支持按维度、时间收敛，支持临时屏蔽，支持写 PromeQL 实现环比/同比
- 数据可视化：grafana 支持 Promtheus 数据源，支持用 PromeQL 写复杂的查询语句

## 生态

- exporters
  - node-exporter
  - blackbox-exporter
  - mysql-exporter
  - kafka-exporter
- prometheus: 存储、计算。采集 k8s 的服务发现方式有 node、service、pod、endpoints、ingress
- thanos-sidecar: 和 prometheus 部署一起，将数据存储到 cos，提供 thanos-query 查询实时数据
- thanos-store: 读取 cos 的数据，提供 thanos-query 查询历史数据
- thanos-query: 读取 thanos-sidecar、thanos-store，提供给 grafana 统一查询
- thanos-compact: 将 cos 数据压缩
- thanos-rules: 统一告警
- alertmanager: 告警
- grafana: 可视化
- push gateway: 采集短期任务推送的数据，提供 prometheus 抓取
- client SDK: 应用程序暴露指标

## 实战

- 功能验证

- 方案规划

  - 高可用架构（成本、容量）
  - 资源（cvm、k8s、haproxy、域名、cos）
  - 两台 32C64G 的 cvm 上可部署 300 个 exporter，每个 exporter 分配 215m 耗 64.5G
  - k8s 指定 ns 部署 consul 集群，3 个 500m1G 的 consul
  - k8s 指定 ns 部署 kong、pushgateway、grafana、altermanager
  - k8s 指定 ns 部署 prometheus、thanos-sidecar、七彩石配置中心 agent，多少天多少指标多大容量
  - k8s 指定 ns 部署 thanos-store、thanos-query，多少天多少容量

- 脚本化

  - 编写 k8s 的 yaml 文件，支持部署 consul、kong、pushgateway、grafana、altermanager、prometheus 及 sidecar 及七彩石配置中心 agent、thanos-store、thanos-query
  - 编写 shell 脚本，支持用 docker 批量安装 cadvisor、node-exporter、process-exporter、mysql-exporter、redis-exporter、kafka-exporter、puslar-expoter、haproxy-exporter 并注册到 consul、中间件注册 kong 地址到 consul
  - 编写 shell 脚本，支持在 cvm 上部署联邦 prometheus，解决跨网段上报
  - 编写 golang 工具，支持 docker 部署各种 exporter 并上报注册到 kong 和 consul，封装并提供注册 consul 的接口
  - 配置 grafana 监控仪表
  - 配置 altermanager 告警策略

- 可采用流水线方式部署，编写 consul、kong、pushgateway 等组件 Dockerfile，编译构建部署
- Prometheus 消耗内存的原因？ 采集量越大内存的数据越多，每隔 2 小时将内存的数据通过 Block 数据落盘；查询不合理，从磁盘加载到内存的数据越多，如 Group 或大范围的 rate
- Promtheus 的痛点？ 节点带宽、CPU、内存、磁盘 IO
- 优化方向？ 丢弃不重要指标；降低采集频率；设置较短过期时间
- 如何节点分担压力？ 部署多套 Prometheus，按业务写入
- 如何确保高可用？ 每个 Promtheus 一个冗余副本采集同样数据
- 如何查询高可用部署后的所有数据？ 查询通过 thanos 去重聚合，告警通过 alertmanager 去重告警
- 如何确保实时性？ sidecar 换成 Prometheus 写 receiver，query 查询 receiver
- 如何存储冷数据？ 中间件将数据远程写到 COS
- 如何查询冷数据？ 实现 query 的 API 查询，可优化缓存 TSDB 增加索引，可优化对象存储请求
- 如何提高查询大时区数据速度？ 压缩和降低采样频率
- thanos 解决的问题？ 远程廉价长期存储；提供统一去重查询；提供压缩

## 参考

- [arch](https://note.youdao.com/s/MnYLTujl)
- [promtheus.io](https://prometheus.io/)
- [Prometheus-configure](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)
- [prometheus-configure-example](https://github.com/prometheus/prometheus/blob/release-2.15/config/testdata/conf.good.yml)
- [prometheus 中文文档](https://prometheus.wang/quickstart/)
- [PromeQL-query](https://prometheus.io/docs/prometheus/latest/querying/basics/)
- [google-正则语法](https://github.com/google/re2/wiki/Syntax)
- [Prometheus 实战](https://www.bookstack.cn/read/prometheus_practice/ha-prometheus.md)
- [Prometheus-book](https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/quickstart)
- [Kube-prometheus](https://runbooks.prometheus-operator.dev/)
- [alertmanager-configure-example](https://github.com/prometheus/alertmanager/blob/main/doc/examples/simple.yml)
- [alertmanger-configure-routing-tree-format](https://prometheus.io/webtools/alerting/routing-tree-editor/)
- [grafana-plugins](https://grafana.com/grafana/plugins/?orderBy=weight&direction=asc)
- [grafana-dashboard](https://grafana.com/grafana/dashboards/?pg=community&plcmt=learn)
- [grafana-online-demo](https://play.grafana.org/d/000000012/grafana-play-home?orgId=1)
- [thanos.io](https://thanos.io/tip/thanos/getting-started.md)
- [爱奇艺号基于 Prometheus 的微服务应用监控实践](https://www.toutiao.com/article/6853605670407799303/?tt_from=copy_link&utm_campaign=client_share&timestamp=1595985449&app=news_article&utm_source=copy_link&utm_medium=toutiao_ios&use_new_style=1&req_id=202007290917290100140400933D79F5B4&group_id=6853605670407799303&wid=1654669650849)
- [Prometheus 活学活用避坑指南](https://cloud.tencent.com/developer/news/629972)
- [规划 Prometheus 存储用量](https://www.jianshu.com/p/93412a925da2)
- [打造云原生大型分布式监控系统(一) 大规模场景下 Prometheus 的优化手段](https://www.bilibili.com/video/BV17C4y1x7HE)
- [打造云原生大型分布式监控系统(二): Thanos 架构详解](https://www.bilibili.com/video/BV1Vk4y1R7S9)
- [打造云原生大型分布式监控系统(三): Thanos 部署与实践](https://www.bilibili.com/video/BV16g4y187HD)