# obs-thanos-plugins **Repository Path**: HuaweiCloudDeveloper/obs-thanos-plugins ## Basic Information - **Project Name**: obs-thanos-plugins - **Description**: 本项目的目的是扩展Thanos,使其支持OBS存储 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2022-10-10 - **Last Updated**: 2023-05-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## 背景 Thanos主要用于解决大规模prometheus部署、增强prometheus高可用的工具,由Improbable团队开源 prometheus本身只支持单机部署,没有自带集群模式,所以大规模集群监控主要通过prometheus联邦机制或者做监控指标服务拆分方式去实现,主要痛点是prometheus对于监控历史数据的存储问题,本地存储数据量有限,只能通过远端存储接口,存储到支持prometheus远端存储接口的数据库中 prometheus官方高可用的方式是通过部署多个prometheus实例采集同一个target,前端通过LB设备做为统一入口,这带来的问题就是,两个prometheus实例内存储的数据会存在差异,特别是当其中一个prometheus宕机后,另外一个prometheus接管服务,此时宕机的prometheus就会丢失宕机期间的监控数据,当LB的请求转发过来会出现数据不一致情况。 Thanos能够解决上述问题,thanos能够将多个prometheus实例的数据进行聚合去重,来支持prometheus横向扩展和提高prometheus的高可用性,同时也支持将历史监控数据存储到**对象存储**中,提供监控数据的可靠性,降低运维难度。 ## 项目简介 目前Thanos不支持HuaweiCloud OBS存储,但是支持其他多种类型对象存储,本项目的目的是扩展Thanos,使其支持OBS存储,并完善相关文档,包含但不限于功能文档,使用文档,代码贡献到Thanos仓库等等。 ## 部署 ### Docker部署 #### 前提条件 - go1.19 - prometheus v2.13以上 - docker 20.10.0 以上 - git #### 使用 **步骤一:下载源码** ```bash git clone https://gitee.com/HuaweiCloudDeveloper/obs-thanos-plugins.git ``` **步骤二:构建Docker镜像** ```bash cd obs-thanos-plugins make docker-multi-stage ``` **步骤三:部署prometheus** thanos的单独部署没有意义,所以需要先部署prometheus,这里用单机部署prometheus进行演示 1.准备prometheus配置 ```bash cd /home mkdir prometheus cd prometheus vi prometheus.yml chmod 666 prometheus.yml ``` 修改prometheus.yml权限为666,让主机修改的文件能够在容器挂载的文件生效 prometheus.yml: ```yaml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: ca replica: 1 scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['172.17.0.1:9090'] ``` 这里演示prometheus只采集自身数据,可以根据已部署的服务在scrape_configs自行添加配置 另外还需要声明external_labels,用于在全局中识别该实例,这里标识为ca集群的1号副本 2.启动prometheus容器 ```bash docker run -d --net=host --rm \ -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \ -v $(pwd)/prometheus_data:/prometheus \ -u root \ --name prometheus \ quay.io/prometheus/prometheus:v2.38.0 \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/prometheus \ --web.listen-address=:9090 \ --web.enable-lifecycle \ --storage.tsdb.min-block-duration=2h \ --storage.tsdb.max-block-duration=2h \ --storage.tsdb.retention.time=30d \ --web.enable-admin-api ``` - web.enable-lifecycle用于热加载配置 - storage.tsdb.retention.time用于保留tsdb 30天 - storage.tsdb.min-block-duration和storage.tsdb.max-block-duration两个参数必须要设置为相同,用于支持thanos的sidecar上传,设置每2h生成一个block - web.enable-admin-api开启api服务,用于支持sidecar获取prometheus的元数据 **步骤四:部署thanos sidecar** thanos sidecar组件需要跟prometheus部署在同一个节点内,它会读取prometheus数据并将TSDB块上传到对象存储桶 1.准备对象存储配置 ```bash vi obs_config.yaml ``` 这里使用OBS作为对象存储,配置如下: ```yaml type: OBS config: bucket: "Your Bucket Name" endpoint: "Your Endpoiot" access_key: "Your AK" secret_key: "Your SK" ``` 2.启动sidecar容器 ```bash docker run -d --net=host --rm \ -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \ -v $(pwd)/obs_config.yml:/etc/prometheus/obs_config.yml \ -v $(pwd)/prometheus_data:/prometheus \ --name thanos-sidecar \ -u root \ thanos:latest \ sidecar \ --http-address 0.0.0.0:19090 \ --grpc-address 0.0.0.0:19190 \ --reloader.config-file /etc/prometheus/prometheus.yml \ --reloader.watch-interval=1m \ --objstore.config-file /etc/prometheus/obs_config.yml \ --tsdb.path /prometheus \ --prometheus.url http://localhost:9090 ``` - objstore.config-file设置对象存储的配置文件 - tsdb.path需要配置prometheus的数据存储目录 3.修改prometheus配置 ```bash vi prometheus.yml ``` 添加sidecar、query组件的配置,reloader能够自动监听到配置的修改并热加载配置 ```bash global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: us1 replica: 1 scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['172.17.0.1:9090'] - job_name: 'sidecar' static_configs: - targets: ['172.17.0.1:19090'] - job_name: 'query' static_configs: - targets: ['172.17.0.1:29090'] - job_name: 'store' static_configs: - targets: ['172.17.0.1:39090'] ``` **步骤五:部署thanos query** ```bash docker run -d --net=host --rm \ --name querier \ thanos:latest \ query \ --http-address 0.0.0.0:29090 \ --query.replica-label replica \ --store 172.17.0.1:19190 \ --store 172.17.0.1:39190 ``` 访问\:29090即可在thanos query的web可视化页面进行查询了 store参数配置的地址为sidecar暴露出来的grpc接口 其中xxx:39190为下面要部署的store组件的grpc接口 **步骤六:部署thanos store** ```BASH docker run -d --net=host --rm \ -v $(pwd)/obs_config.yml:/etc/prometheus/obs_config.yml \ -v $(pwd)/store_data:/prometheus/store_data \ --name store \ thanos:latest \ store \ --objstore.config-file /etc/prometheus/obs_config.yml \ --http-address 0.0.0.0:39090 \ --grpc-address 0.0.0.0:39190 \ --data-dir /prometheus/store_data ``` store组件从OBS中获取持久化数据,使query能够查询历史数据 ### CCE部署 #### 前提条件 - go1.19 - prometheus v2.13以上 - docker 20.10.0以上 - git #### 使用 > 步骤一~步骤三会在本地构建镜像,如果使用我们部署在[dockerhub](https://hub.docker.com/r/pengcss/thanos/tags)的镜像,可以直接跳到步骤四 > > dockerhub镜像地址:pengcss/thanos:0.32.0 **步骤一:下载源码** ```bash git clone https://gitee.com/HuaweiCloudDeveloper/obs-thanos-plugins.git ``` **步骤二:构建Docker镜像** ```bash cd obs-thanos-plugins make docker-multi-stage ``` **步骤三:上传镜像到[容器镜像服务SWR](https://www.huaweicloud.com/product/swr.html)** ```sh #使用临时指令登录 docker login -u ${用户名} -p ${密码} ${区域}.myhuaweicloud.com docker build -t ${镜像名称}:${版本名称} . docker tag ${镜像名称}:${版本名称} ${区域}.myhuaweicloud.com/${组织名称}/${镜像名称}:${版本名称} docker push ${区域}.myhuaweicloud.com/${组织名称}/${镜像名称}:${版本名称} ``` 详细参考[SWR](https://console.huaweicloud.com/swr/?region=cn-north-4#/swr/dashboard) **步骤四:购买CCE集群** | 类型 | 规格 | | -------- | ------------------------------------------ | | 计费模式 | 按需计费 | | 网络模型 | VPC | | 节点参数 | (通用计算型\|s6.xlarge.2\|4vCPUs\|8GiB) x2 | | 节点系统 | Euler2.5 | **步骤五:准备prometheus配置项** 这里我们以yaml的方式部署ConfigMap到cce集群,准备prometheus-config.yaml文件: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: thanos data: prometheus.yaml: |- global: scrape_interval: 5s evaluation_interval: 5s external_labels: cluster: prometheus-ha prometheus_replica: $(POD_NAME) rule_files: - /etc/prometheus/rules/*rules.yaml scrape_configs: - job_name: cadvisor metrics_path: /metrics/cadvisor scrape_interval: 10s scrape_timeout: 10s scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) ``` > 这里仅以采集cadvisor容器指标作为示例,可根据实际需要修改采集配置 进入cce集群,执行命令 ```sh kubectl apply -f prometheus-config.yaml ``` **步骤六:准备thanos对象存储配置** 准备objstore.yaml文件 ```yaml apiVersion: v1 kind: Secret metadata: name: thanos-objectstorage namespace: thanos type: Opaque stringData: objectstorage.yaml: |- type: OBS config: bucket: "Your Bucket Name" endpoint: "Your Endpoiot" access_key: "Your AK" secret_key: "Your SK" ``` ```yaml kubectl apply -f objstore.yaml ``` **步骤七:给prometheus绑定RBAC权限** 准备prometheus-auth.yaml ```yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: thanos --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus namespace: thanos rules: - apiGroups: [""] resources: - nodes - nodes/proxy - nodes/metrics - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["configmaps"] verbs: ["get"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: thanos roleRef: kind: ClusterRole name: prometheus apiGroup: rbac.authorization.k8s.io ``` ```sh kubectl apply -f prometheus-auth.yaml ``` **步骤八:部署prometheus和sidecar** 准备prometheus.yaml ```yaml apiVersion: v1 kind: Service metadata: name: prometheus-headless namespace: thanos labels: app.kubernetes.io/name: prometheus spec: type: ClusterIP clusterIP: None selector: app.kubernetes.io/name: prometheus ports: - name: web protocol: TCP port: 9090 targetPort: web - name: grpc port: 10901 targetPort: grpc --- apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus namespace: thanos labels: app.kubernetes.io/name: thanos-query spec: serviceName: prometheus-headless podManagementPolicy: Parallel replicas: 2 selector: matchLabels: app.kubernetes.io/name: prometheus template: metadata: labels: app.kubernetes.io/name: prometheus spec: serviceAccountName: prometheus securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - prometheus topologyKey: kubernetes.io/hostname containers: - name: prometheus image: ${区域}.myhuaweicloud.com/${组织名称}/${镜像名称}:${版本名称} args: - --config.file=/etc/prometheus/config_out/prometheus.yaml - --storage.tsdb.path=/prometheus - --storage.tsdb.retention.time=10d - --web.route-prefix=/ - --web.enable-lifecycle - --storage.tsdb.no-lockfile - --storage.tsdb.min-block-duration=2h - --storage.tsdb.max-block-duration=2h - --log.level=debug ports: - containerPort: 9090 name: web protocol: TCP livenessProbe: failureThreshold: 6 httpGet: path: /-/healthy port: web scheme: HTTP periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 readinessProbe: failureThreshold: 120 httpGet: path: /-/ready port: web scheme: HTTP periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 volumeMounts: - mountPath: /etc/prometheus/config_out name: prometheus-config-out readOnly: true - mountPath: /prometheus name: prometheus-storage - mountPath: /etc/prometheus/rules name: prometheus-rules - name: thanos image: ${区域}.myhuaweicloud.com/${组织名称}/${镜像名称}:${版本名称} args: - sidecar - --log.level=debug - --tsdb.path=/prometheus - --prometheus.url=http://127.0.0.1:9090 - --objstore.config-file=/etc/thanos/objectstorage.yaml - --reloader.config-file=/etc/prometheus/config/prometheus.yaml.tmpl - --reloader.config-envsubst-file=/etc/prometheus/config_out/prometheus.yaml - --reloader.rule-dir=/etc/prometheus/rules/ env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name ports: - name: http-sidecar containerPort: 10902 - name: grpc containerPort: 10901 livenessProbe: httpGet: port: 10902 path: /-/healthy readinessProbe: httpGet: port: 10902 path: /-/ready volumeMounts: - name: prometheus-config mountPath: /etc/prometheus/config - name: prometheus-config-out mountPath: /etc/prometheus/config_out - name: prometheus-rules mountPath: /etc/prometheus/rules - name: prometheus-storage mountPath: /prometheus - name: thanos-objectstorage subPath: objectstorage.yaml mountPath: /etc/thanos/objectstorage.yaml imagePullSecrets: - name: default-secret volumes: - name: prometheus-config-tmpl configMap: defaultMode: 420 name: prometheus-config-tmpl - name: prometheus-config-out emptyDir: {} - name: prometheus-rules configMap: name: prometheus-rules - name: thanos-objectstorage secret: secretName: thanos-objectstorage volumeClaimTemplates: - metadata: name: prometheus-storage labels: app.kubernetes.io/name: prometheus spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: csi-disk volumeMode: Filesystem ``` 镜像地址需要改为在SWR上刚刚上传镜像的地址 ```sh kubectl apply -f prometheus.yaml ``` **步骤九:部署thanos query组件** 准备thanos-query.yaml ```yaml apiVersion: v1 kind: Service metadata: name: thanos-query namespace: thanos labels: app.kubernetes.io/name: thanos-query spec: ports: - name: grpc port: 10901 targetPort: grpc - name: http port: 9090 targetPort: http selector: app.kubernetes.io/name: thanos-query --- apiVersion: apps/v1 kind: Deployment metadata: name: thanos-query namespace: thanos labels: app.kubernetes.io/name: thanos-query spec: replicas: 3 selector: matchLabels: app.kubernetes.io/name: thanos-query template: metadata: labels: app.kubernetes.io/name: thanos-query spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - thanos-query topologyKey: kubernetes.io/hostname weight: 100 containers: - args: - query - --log.level=debug - --query.auto-downsampling - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:9090 - --query.partial-response - --query.replica-label=prometheus_replica - --query.replica-label=rule_replica - --store=dnssrv+_grpc._tcp.prometheus-headless.thanos.svc.cluster.local - --store=dnssrv+_grpc._tcp.thanos-rule.thanos.svc.cluster.local - --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local image: ${区域}.myhuaweicloud.com/${组织名称}/${镜像名称}:${版本名称} livenessProbe: failureThreshold: 4 httpGet: path: /-/healthy port: 9090 scheme: HTTP periodSeconds: 30 name: thanos-query ports: - containerPort: 10901 name: grpc - containerPort: 9090 name: http readinessProbe: failureThreshold: 20 httpGet: path: /-/ready port: 9090 scheme: HTTP periodSeconds: 5 terminationMessagePolicy: FallbackToLogsOnError terminationGracePeriodSeconds: 120 imagePullSecrets: - name: default-secret ``` 同上,镜像地址需要改为刚上传到SWR的镜像的地址 ```sh kubectl apply -f thanos-query.yaml ``` **步骤十:部署thanos store组件** 准备thanos-store.yaml ```yaml apiVersion: v1 kind: Service metadata: name: thanos-store namespace: thanos labels: app.kubernetes.io/name: thanos-store spec: clusterIP: None ports: - name: grpc port: 10901 targetPort: 10901 - name: http port: 10902 targetPort: 10902 selector: app.kubernetes.io/name: thanos-store --- apiVersion: apps/v1 kind: StatefulSet metadata: name: thanos-store namespace: thanos labels: app.kubernetes.io/name: thanos-store spec: replicas: 2 selector: matchLabels: app.kubernetes.io/name: thanos-store serviceName: thanos-store podManagementPolicy: Parallel template: metadata: labels: app.kubernetes.io/name: thanos-store spec: containers: - args: - store - --log.level=debug - --data-dir=/var/thanos/store - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --objstore.config-file=/etc/thanos/objectstorage.yaml image: ${区域}.myhuaweicloud.com/${组织名称}/${镜像名称}:${版本名称} livenessProbe: failureThreshold: 8 httpGet: path: /-/healthy port: 10902 scheme: HTTP periodSeconds: 30 name: thanos-store ports: - containerPort: 10901 name: grpc - containerPort: 10902 name: http readinessProbe: failureThreshold: 20 httpGet: path: /-/ready port: 10902 scheme: HTTP periodSeconds: 5 terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /var/thanos/store name: data readOnly: false - name: thanos-objectstorage subPath: objectstorage.yaml mountPath: /etc/thanos/objectstorage.yaml terminationGracePeriodSeconds: 120 volumes: - name: thanos-objectstorage secret: secretName: thanos-objectstorage imagePullSecrets: - name: default-secret volumeClaimTemplates: - metadata: labels: app.kubernetes.io/name: thanos-store name: data spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: csi-disk ``` ```sh kubectl apply -f thanos-store.yaml ``` 这样thanos基础组件部署完毕,其他compact和ruler组件可以根据需要部署,这里不进行演示 **步骤十一:对外暴露thanos query访问接口** 通过nginx-ingress暴露对外访问接口 1.在CCE控制台的"插件管理"页面安装nginx-ingress插件,参数设置实例数为1,ELB负载均衡器为共享型。 ![nginx-ingress](docs\img\nginx-ingress) 2.进去"服务发现"页面,右侧选择"路由"页签,单击右上角"创建路由" 选择对接nginx,并配置路由策略 目标服务选择thanos-query,访问端口选择9090 ![image-20230324173325527](docs\img\nginx-ingress-2) 创建完成后,在Ingress列表可查看到已添加的Ingress 访问\:80即可在thanos query的web可视化页面进行查询了 ## 参考资料 - [OBS](https://support.huaweicloud.com/bestpractice-obs/obs_05_1507.html) - [Thanos](https://github.com/thanos-io/thanos) - [Prometheus](https://github.com/prometheus/prometheus) 待补充...