# Dubhe-deploy_third **Repository Path**: corner2007/Dubhe-deploy_third ## Basic Information - **Project Name**: Dubhe-deploy_third - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 24 - **Created**: 2025-07-19 - **Last Updated**: 2025-07-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README v1 版 深度学习平台 k8s 部署 https://www.yuque.com/docs/share/256a6d45-fd3f-42ea-99d2-d41d7dd6819c?# 《一键部署脚步使用》 # 天枢深度学习平台部署 ## 一、环境准备 1 登录 `192.168.1.101` 服务器(可以连外网),切换为 root用户: ```bash ssh root@192.168.1.101 #Root!sss ``` 2、查看已经部署的k8s集群节点信息: ```bash [root@192.168.1.31 ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION 192.168.1.101 Ready node 18d v1.20.5 (GPU) 192.168.1.189 Ready node 60d v1.20.5 192.168.1.31 Ready,SchedulingDisabled master 102d v1.20.5 (部署节点) 192.168.1.32 Ready,SchedulingDisabled master 102d v1.20.5 192.168.1.34 Ready,SchedulingDisabled master 102d v1.20.5 192.168.1.35 Ready node 102d v1.20.5 192.168.1.36 Ready node 102d v1.20.5 ``` 部署目录: ```bash # 192.168.1.101 cd /data/deeplearning ``` dodcker 配置: ```bash [root@192.168.1.101 ~]#cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } }, "registry-mirrors": ["https://docker.mirrors.ustc.edu.cn","https://hub-mirror.c.163.com", "https://192.168.1.64","https://magic-harbor.magic.com"], "insecure-registries": ["magic-harbor.magic.com"], "data-root": "/data/docker_data", "hosts": ["tcp://0.0.0.0:2375", "unix:///var/run/docker.sock"], "storage-driver": "overlay2", "log-driver": "json-file", "log-opts": { "max-file": "3", "max-size": "10m", "env": "os,customer", "labels": "somelabel" } } ``` ## 中间件部署 Harbor镜像仓库地址: ```bash https://magic-harbor.magic.com/harbor/projects admin/Harbor12345 # https://harbor.genesismagic.coop/ ``` ### minio 本文档为MinIO部署说明,前端通过MinIO为用户提供对象存储服务,推荐与NFS服务在同台机器部署,避免出现大批量文件传输、拷贝的网络IO性能问题。 #### 部署 #### 下载离线镜像 ```bash # docker下载命令: docker pull minio/minio:RELEASE.2023-05-18T00-05-36Z # docker保存镜像命令: docker save > minio-2023-0518-release.tar minio/minio:RELEASE.2023-05-18T00-05-36Z # 将导出的离线包上传到 /data/offline-images 目录 mkdir -p /data/offline-images cd /data/offline-images mv /home/wangchao/softwares/minio-2023-0518-release.tar /data/offline-images/ # 加载镜像命令: docker load -i minio-2023-0518-release.tar # 查看镜像列表 [root@192.168.1.31 offline-images]# docker images | grep minio minio/minio RELEASE.2023-05-18T00-05-36Z 1370947c8f2f 5 months ago 363MB minio 2021.11.9-debian-10-r0 fbfaca6d1051 22 months ago 250MB minio 2021.12.29-debian-10-r0 fbfaca6d1051 22 months ago 250MB magic-harbor.magic.com/library/minio 2021.11.9-debian-10-r0 fbfaca6d1051 22 months ago 250MB magic-harbor.magic.com/library/minio 2021.12.29-debian-10-r0 fbfaca6d1051 22 months ago 250MB # 打标签 docker tag minio/minio:RELEASE.2023-05-18T00-05-36Z magic-harbor.magic.com/library/minio:RELEASE.2023-05-18T00-05-36Z # 推送到镜像仓库 docker push magic-harbor.magic.com/library/minio:RELEASE.2023-05-18T00-05-36Z # 推送可能失败,需要手动登录 docker login magic-harbor.magic.com admin/Harbor12345 (输入用户名和密码) # 然后再重新推送 # docker push magic-harbor.magic.com/library/minio:RELEASE.2023-05-18T00-05-36Z ``` #### 部署 通过 docker 部署 MinIO。 ```bash mkdir -p /data/minio/config mkdir -p /data/minio/data ``` 一个用来存放配置,一个用来存储上传文件的目录 启动前需要先创建 Minio 外部挂载的配置文件(/data/minio/config),和存储上传文件的目录(/data/minio/data) ```bash docker run -p 9000:9000 -p 9090:9090 \ --name=minio \ -d --restart=always \ -e "MINIO_ACCESS_KEY=admin" \ -e "MINIO_SECRET_KEY=admin123" \ -v /data/minio/data:/data \ -v /data/minio/config:/root/.minio \ magic-harbor.magic.com/library/minio:RELEASE.2023-05-18T00-05-36Z server \ /data --console-address ":9090" -address ":9000" ``` 根据自己的需要来调整相关参数,如下: MINIO_ACCESS_KEY 是登录名 MINIO_SECRET_KEY 是密码 查看部署: ```bash # 删除镜像 # docker rmi image_id # 删除容器 # docker rm container_id docker ps -a | grep minio ``` 查看部署状态: ```bash [root@192.168.1.31 ~]#docker ps | grep minio ddb0fe9e02b0 magic-harbor.magic.com/library/minio:RELEASE.2023-05-18T00-05-36Z "/usr/bin/docker-ent…" 9 seconds ago Up 8 seconds 0.0.0.0:9000->9000/tcp, 0.0.0.0:9090->9090/tcp minio ``` 登录minio后台: http://192.168.1.31:9000/ ### EFK 日志管理系统 项目中,集群日志管理采取的方案为 Elasticsearch、Fluentd Bit、Kibana 组合,它们分工如下: - Fluentd Bit 负责从 Kubernetes 搜集日志并发送给 Elasticsearch。 - Elasticsearch 负责存储日志并提供查询接口。 - Kibana 用于浏览和搜索存储在 Elasticsearch 中的日志。 > 通过在每台 node 上部署一个以 DaemonSet 方式运行 Fluentd Bit 来收集每台 node 上的日志。Fluentd Bit 将 docker 日志目录 /var/lib/docker/containers 和 /var/log 目录挂载到 Pod 中,然后 Pod 会在 node 节点的 /var/log/pods 目录中创建新的目录,可以区别不同的容器日志输出,该目录下有一个日志文件链接到 /var/lib/docker/contianers 目录下的容器日志输出。 #### 部署方案 elasticsearch.yaml ```yaml apiVersion: v1 kind: Service metadata: name: elasticsearch-logging namespace: kube-system labels: k8s-app: elasticsearch-logging kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "Elasticsearch" spec: type: NodePort ports: - name: http nodePort: 32321 port: 9200 protocol: TCP targetPort: db - name: http-9300 nodePort: 32322 port: 9300 protocol: TCP targetPort: transport selector: k8s-app: elasticsearch-logging --- # RBAC authn and authz apiVersion: v1 kind: ServiceAccount metadata: name: elasticsearch-logging namespace: kube-system labels: k8s-app: elasticsearch-logging addonmanager.kubernetes.io/mode: Reconcile --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: elasticsearch-logging labels: k8s-app: elasticsearch-logging addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - "services" - "namespaces" - "endpoints" verbs: - "get" --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: kube-system name: elasticsearch-logging labels: k8s-app: elasticsearch-logging addonmanager.kubernetes.io/mode: Reconcile subjects: - kind: ServiceAccount name: elasticsearch-logging namespace: kube-system apiGroup: "" roleRef: kind: ClusterRole name: elasticsearch-logging apiGroup: "" --- # Elasticsearch deployment itself apiVersion: apps/v1 kind: StatefulSet metadata: name: elasticsearch-logging namespace: kube-system labels: k8s-app: elasticsearch-logging version: v7.4.2 addonmanager.kubernetes.io/mode: Reconcile spec: serviceName: elasticsearch-logging replicas: 3 selector: matchLabels: k8s-app: elasticsearch-logging version: v7.4.2 template: metadata: labels: k8s-app: elasticsearch-logging version: v7.4.2 spec: serviceAccountName: elasticsearch-logging containers: - args: - -c - "sed -i 's#-Xms1g#-Xms4g#g;s#-Xmx1g#-Xmx4g#g' /usr/share/elasticsearch/config/jvm.options && echo -e 'http.cors.enabled: true\nhttp.cors.allow-origin: \"*\"' >> config/elasticsearch.yml && bin/run.sh" command: - /bin/bash #image: quay.io/fluentd_elasticsearch/elasticsearch:v7.4.2 image: magic-harbor.magic.com/library/elasticsearch:v7.4.2 name: elasticsearch-logging imagePullPolicy: IfNotPresent resources: # need more cpu upon initialization, therefore burstable class limits: cpu: 10000m memory: 8Gi requests: cpu: 1000m memory: 4Gi ports: - containerPort: 9200 name: db protocol: TCP - containerPort: 9300 name: transport protocol: TCP # livenessProbe: # tcpSocket: # port: transport # initialDelaySeconds: 5 # timeoutSeconds: 10 # readinessProbe: # tcpSocket: # port: transport # initialDelaySeconds: 5 # timeoutSeconds: 10 volumeMounts: - name: elasticsearch-logging mountPath: /data env: - name: "NAMESPACE" valueFrom: fieldRef: fieldPath: metadata.namespace volumes: - name: elasticsearch-logging emptyDir: {} # Elasticsearch requires vm.max_map_count to be at least 262144. # If your OS already sets up this number to a higher value, feel free # to remove this init container. initContainers: - image: magic-harbor.magic.com/library/alpine:3.6 command: ["/sbin/sysctl", "-w", "vm.max_map_count=262144"] name: elasticsearch-logging-init securityContext: privileged: true ``` kibana.yaml ```yaml apiVersion: v1 kind: Service metadata: name: kibana-logging namespace: kube-system labels: k8s-app: kibana-logging kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "Kibana" spec: type: NodePort ports: - port: 5601 protocol: TCP targetPort: ui selector: k8s-app: kibana-logging --- apiVersion: apps/v1 kind: Deployment metadata: name: kibana-logging namespace: kube-system labels: k8s-app: kibana-logging addonmanager.kubernetes.io/mode: Reconcile spec: replicas: 1 selector: matchLabels: k8s-app: kibana-logging template: metadata: labels: k8s-app: kibana-logging annotations: seccomp.security.alpha.kubernetes.io/pod: 'docker/default' spec: containers: - name: kibana-logging #image: docker.elastic.co/kibana/kibana-oss:7.2.0 image: magic-harbor.magic.com/library/kibana-oss:7.2.0 # image: magic-harbor.magic.com/library/kibana:7.6.2 resources: # need more cpu upon initialization, therefore burstable class limits: cpu: 1000m requests: cpu: 100m env: - name: ELASTICSEARCH_HOSTS value: http://elasticsearch-logging:9200 - name: SERVER_NAME value: kibana-logging # - name: SERVER_BASEPATH # value: /api/v1/namespaces/kube-system/services/kibana-logging/proxy - name: SERVER_REWRITEBASEPATH value: "false" ports: - containerPort: 5601 name: ui protocol: TCP livenessProbe: httpGet: path: /api/status port: ui initialDelaySeconds: 5 timeoutSeconds: 10 readinessProbe: httpGet: path: /api/status port: ui initialDelaySeconds: 5 timeoutSeconds: 10 ``` fluent-bit.yaml ```yaml apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: kube-system labels: k8s-app: fluent-bit data: # Configuration files: server, input, filters and output # ====================================================== fluent-bit.conf: | [SERVICE] Flush 1 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 @INCLUDE input-kubernetes.conf @INCLUDE filter-kubernetes.conf @INCLUDE output-elasticsearch.conf input-kubernetes.conf: | [INPUT] Name tail Tag kube.* Path /var/log/containers/annotation*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 [INPUT] Name tail Tag kube.* Path /var/log/containers/tadl*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 [INPUT] Name tail Tag kube.* Path /var/log/containers/train*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 [INPUT] Name tail Tag kube.* Path /var/log/containers/serving*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 [INPUT] Name tail Tag kube.* Path /var/log/containers/modelopt*.log Parser docker DB /var/log/flb_kube.db [INPUT] Name tail Tag kube.* Path /var/log/containers/modelopt*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 [INPUT] Name tail Tag kube.* Path /var/log/containers/batchserving*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 [INPUT] Name tail Tag kube.* Path /var/log/containers/pointcloud*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 [INPUT] Name tail Tag kube.* Path /var/log/containers/data-rn*.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 1 filter-kubernetes.conf: | [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed K8S-Logging.Parser On K8S-Logging.Exclude Off output-elasticsearch.conf: | [OUTPUT] Name es Match kube.* Host ${FLUENT_ELASTICSEARCH_HOST} Port ${FLUENT_ELASTICSEARCH_PORT} Index kubelogs Replace_Dots On Retry_Limit False Type doc parsers.conf: | [PARSER] Name apache Format regex Regex ^(?[^ ]*) [^ ]* (?[^ ]*) \[(?