From aa2b49848e2a945367853bd57b36223dc2e4eab9 Mon Sep 17 00:00:00 2001 From: lihengwei Date: Tue, 12 Mar 2024 16:27:56 +0800 Subject: [PATCH] feature: adjust README.en.md Signed-off-by: lihengwei --- README.en.md | 1021 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 997 insertions(+), 24 deletions(-) diff --git a/README.en.md b/README.en.md index 3ff995f..cb7073a 100644 --- a/README.en.md +++ b/README.en.md @@ -1,36 +1,1009 @@ # gala-docs -#### Description -Handbook and requirements documentation +#### Introduction -#### Software Architecture -Software architecture description +![](./png/logo.png) -#### Installation +gala is a C/S architecture, AI-based operating system sub-health diagnostic tool. It is based on eBPF + java agent non-intrusive observation technology, and is assisted by AI technology to achieve minute-level diagnosis of sub-health faults (such as performance jitters, increased error rates, system freezes, etc.) and simplify the operation and maintenance process of IT infrastructure. -1. xxxx -2. xxxx -3. xxxx +# background -#### Instructions +In recent years, with the implementation of cloud native, serverless and other technologies, the complexity of cloud infrastructure operation and maintenance has become more and more challenging, especially the characteristics of sub-health problems (intermittent occurrence, short duration, There are many types of problems and a wide range of problems, etc.), which brings important challenges to cloud infrastructure fault diagnosis. The challenges of sub-health fault diagnosis (including observability, massive data management capabilities, AI algorithm generalization capabilities, etc.) have become particularly prominent in Linux scenarios. In the openEuler open source operating system, the existing operation and maintenance methods are not enough to detect and locate sub-health problems in a timely manner. The problems include: lack of online and continuous monitoring capabilities; lack of observation capabilities with refined application perspective; lack of full-stack observation data-based Issues such as automation and AI analysis capabilities. However, the difficulties in diagnosing sub-health faults include: -1. xxxx -2. xxxx -3. xxxx +- Full-stack non-intrusive observability. +- Continuous, refined, low-load monitoring capabilities. +- Adaptable anomaly detection and visual fault derivation capabilities for different application scenarios. -#### Contribution +# Project Description -1. Fork the repository -2. Create Feat_xxx branch -3. Commit your code -4. Create Pull Request +The overall architecture of gala is as shown in the figure. It is a C/S architecture as a whole. On the production node gala-gopher is a Linux background program, which is responsible for providing full-scenario, full-stack (including Metrics, Events, Tracing, etc.) data collection. It supports OpenTelemetry open ecological interface (supports prometheus exporter, kafka client, etc.) The data is passed to the management node. The management node deploys gala-spider and gala-anteater components, which are respectively responsible for cluster topology calculation and visual root cause derivation; +The gala architecture relies on some open source middleware (including prometheus, kafka, Elastic, etc.), but it can also be connected to the existing middleware of the customer's IT system. The gala architecture design provides integration capabilities and can be integrated by industry customers' IT operation and maintenance systems. It provides two types of integration methods: -#### Gitee Feature +- Software ecological integration method: You can only use gala-gopher observability capabilities (OpenTelemetry method to obtain data), or you can use all capabilities to obtain observation data, anomaly detection results, and visual derivation results through middleware such as prometheus, Elastic, and kafka. +- Tool integration method: Integrate the capabilities provided by gala into the customer's IT operation and maintenance system in the form of Grafana. -1. You can use Readme\_XXX.md to support different languages, such as Readme\_en.md, Readme\_zh.md -2. Gitee blog [blog.gitee.com](https://blog.gitee.com) -3. Explore open source project [https://gitee.com/explore](https://gitee.com/explore) -4. The most valuable open source project [GVP](https://gitee.com/gvp) -5. The manual of Gitee [https://gitee.com/help](https://gitee.com/help) -6. The most popular members [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/) +![](./png/gala-arch.png) + + gala can provide customers with the following operation and maintenance capabilities: + +- Online application performance jitter diagnosis: Provides online application performance diagnosis capabilities for databases, including network problems (packet loss, retransmission, delay, TCP zero window, etc.), I/O problems (slow disk, I/O performance degradation, etc.) ) problems, scheduling problems (including sysCPU surge, deadlock, etc.), memory problems (OOM, leaks, etc.), etc. +- System performance bottleneck diagnosis: Provides the ability to diagnose TCP and I/O performance jitter problems in general scenarios. +- System hazard inspection: Provides kernel protocol stack packet loss, virtualized network packet loss, TCP exceptions, I/O delay exceptions, system call exceptions, resource leaks, JVM exceptions, and application RPC exceptions (including error rates for 8 common protocols) , delay, etc.) hardware faults (UCE, disk media errors, etc.) and other second-level inspection capabilities. +- System full-stack I/O observability: Provides full-stack I/O observation capabilities for distributed storage scenarios, including GuestOS process-level and block-layer I/O observation capabilities, as well as virtualization layer storage front-end I/O observation capabilities. Distributed storage backend I/O observation capabilities. +- Refined performance profiling: Provides multi-dimensional (including multiple dimensions such as system, process, container, Pod, etc.), high-precision (10ms sampling period) performance (including CPU performance, memory usage, resource usage, system calls, etc.) flame graphs , timeline chart, which can be collected continuously online in real time. +- K8S Pod full-stack observability and diagnosis: Provides real-time topology capabilities of Pod cluster business flows from the K8S perspective, Pod performance observation capabilities, DNS observation capabilities, SQL observation capabilities, etc. + +The key technologies involved in gala include the following: + +- Integrated non-invasive observation technology: Integrate the advantages of different observation technologies such as eBPF and Java agent to achieve multi-language (supporting mainstream languages ​​such as C/C++, Java, Go, etc.), full software stack (including kernel, system calls, basic library Glibc, and runtime) jvm, basic middleware Nginx/Haproxy, etc.). +- Process topology: Based on time-series data (L4/L7 layer traffic, etc.), real-time calculation generates time-series topology structure and dynamically displays business cluster topology changes. +- Visualized root cause location: The statistical inference model combines the full process topology to achieve visual and minute-level problem root cause diagnosis. + +# Application scenarios + + gala is mainly oriented to scenarios in Linux environments such as openEuler, including databases, distributed storage, virtualization, cloud native and other scenarios. Help customers in finance, telecommunications, Internet and other industries to achieve minute-level diagnosis of sub-health faults based on full-stack observability. + +# Project code warehouse + +https://gitee.com/openeuler/gala-gopher + +https://gitee.com/openeuler/gala-spider + +https://gitee.com/openeuler/gala-anteater + +# Quick installation + +## architecture + +gala has a C/S architecture and can be deployed in a cluster or on a stand-alone basis. The entire architecture consists of gala-gopher and gala-ops. In cluster mode, gala-gopher is installed in the production node, and gala-ops is installed in the management node; in stand-alone mode, both are installed in the production node. + +Among them, gala-ops software includes gala-spider, gala-anteater, and gala-inference components. + +![](./png/csp_arch.png) + +## gala-gopher + +### positioning + +- Data collector: Provides low-level data collection of application granularity, including collection of system indicators in network, disk I/O, scheduling, memory, security, etc., and is also responsible for collection of application KPI data. Data types include logging, tracing, and metrics. +- System anomaly detection: Provides system anomaly detection capabilities, covering scenario system anomalies in network, disk I/O, scheduling, memory, etc. Users can set the upper and lower limits of exceptions through thresholds. +- Performance hotspot analysis: Provides CPU, memory, and IO flame graphs. + +### Principles and terminology + +gala-gopher software architecture reference here, it is a low-load probe framework based on eBPF technology. In addition to its own data collection, users can freely extend third-party probes. + +**terminology** + +- Probe: A program in gala-gopher that performs specific data collection tasks, including native and extend probes. The former starts the data collection task separately in thread mode, and the latter starts the data collection task in sub-process mode. gala-gopher can start some or all probes through configuration modification. +- Observation entity (entity_name): used to define the observation object in the system. All data collected by the probe will be attributed to a specific observation entity. Each observation entity is composed of key, label (optional), and metrics. For example, the key of the tcp_link observation entity includes information such as process number, IP quintuple, protocol family, etc., and metrics includes running status indicators such as tx, rx, rtt, etc. In addition to the natively supported observation entities, gala-gopher can also extend observation entities. +- Data table (table_name): The observation entity is composed of one or more data tables. Usually one data table is completed by one collection task. It can be seen that a single observation entity can be completed by multiple collection tasks. +- Meta file: Define observation entities (including internal data tables) through files. Meta files in the system must be unique and definitions must not conflict. Specification reference here. + +### Supported technologies + +Collection range: refer to here. Covers RED (Request, Error, Delay) data observation of network, I/O, memory, network card, scheduling, Redis, kafka, Nginx and other kernels and basic software. + +System exception scope: refer to here. Covers automatic inspection and reporting capabilities for more than 60 system hidden danger points, including TCP, Socket, process/thread, I/O, scheduling, etc. + +### Installation and use + +Reference here + +### Expand the scope of data collection + +If users want to expand the scope of data collection, they only need to perform two steps: define observation entities and integrate data probes. + +- **Define observation entities** + +By defining an observation entity (or updating the original observation entity), it is used to carry newly collected metrics data. The user defines the key, label (optional), and metrics of the observation entity through the meta file (refer here). After the definition is completed, the meta file is archived in the probe directory. + +- **Integrated data probe** + +Users can package the data collection software through various programming languages ​​(shell, python, java, etc.), and output the collected data through the Linux pipe character in the script according to the meta file definition format, refer to here. + +Refer to the cAdvisor third-party probe integration case. + +## gala-spider + +### positioning + +- Topology map construction: Provides OS-level topology map construction function. It will regularly obtain the data of all observation object instances collected from gala-gopher, calculate the topological relationship between them, and finally save the generated topology map to the graph database arangodb. middle. + +### Principles and terminology + + Reference here. + +### Supported technologies + +**Supported topological relationship types** + +There are often physical or logical relationships between OS observation entities. For example, threads and processes have affiliations, and processes often have connection relationships. Therefore, gala-spider defines some common topological relationship types. For details, see gala-spider design document: Relationship type definition. After defining the topological relationship type, you can then define the topological relationship between observed entities and then build a topological diagram. + +**List of supported entity relationships** + +gala-spider defines some topological relationships between observation entities by default. These topological relationships are configurable and extensible. For details, see the gala-spider design document: Supported topological relationships. + +### Installation and use + +Reference here. + +### Extend observation entities and relationships + +Reference here. + +## gala-anteater + +### positioning + +- Anomaly detection: For the operating system, it provides minute-level anomaly detection capabilities, which can promptly detect system-level anomalies that potentially affect client latency, assist operation and maintenance personnel, and quickly track and solve problems. +- Abnormal reporting: When abnormal behavior is discovered, the platform can report it to Kafka in real time. Operation and maintenance personnel only need to subscribe to the Kafka message queue to understand whether the current system is at risk. + +### Principles and terminology + +gala-anteater is an AI-based operating system anomaly detection platform. It mainly covers functions such as time series data preprocessing, abnormal point discovery, and exception reporting. Based on offline pre-training, incremental learning and model updating of online models, it can be well adapted to fault diagnosis of multi-dimensional and multi-modal data. + +- Fundamental + + Through the combination of online and offline, online learning technology is used to realize offline learning of the model, online update, and application to online anomaly detection. + + **Offline**: First, use the offline historical KPI data set to obtain a training set after data preprocessing and feature selection; then, use the obtained training set to train and tune the unsupervised neural network model (such as Variational Autoencoder). Finally, the optimal model is selected using the manually labeled test set. + + **Online**: Deploy the model trained offline to online, then use real online data sets to conduct online training and parameter tuning of the model, and then use the trained model to perform real-time anomaly detection in the online environment. + + ![](./png/anteater_arch.png) + +### Installation and use + +Reference here + +## gala-inference + +### positioning + +- Root cause positioning: Provides root cause positioning capabilities for abnormal KPIs. It is based on the results of anomaly detection and the topology map as input, and outputs the results of root cause positioning to kafka. + +### Principles and terminology + +Reference here. + +### Supported technologies + +**expert rules** + +In order to improve the accuracy and interpretability of root cause location results, we analyzed some actual causal relationships between observed entities in the operating system field, and summarized some general expert rules to guide subsequent root cause determination. Location Algorithm. Details of these general expert rules can be found in the gala-inference design document: Expert Rules. + +### Installation and use + +Reference here. + +## gala system integration + +gala also relies on some open source software, including kafka, arangodb, prometheus, etc. The following figure introduces the gala system integration relationship. Kafka is used to transmit logs/tracing data to ES/logstash/jaeger, prometheus is used to store Metrics data, Arangodb is used to store real-time topology data, and grafana is used for front-end page display. + +![](./png/system_integration.png) + +## gala system installation + +Gala provides an integrated deployment tool Gala-Deploy-Tools for users to quickly deploy gala-gopher, gala-ops (gala-spider/gala-inference/gala-anteater) components, kafka/prometheus/arangodb/es/logstash/pyroscope intermediate The software and grafana front-end pages display related components and support both offline/online deployment modes. + +- kafka is used to transmit gala software data +- prometheus is used to store gopher metrics data +- arangodb is used to store real-time topology data generated by gala-spider +- elasticsearch/logstash stores gala data and supports grafana front-end display +- pyroscope stores gopher flame graph data +- grafana displays gala front-end page + +### Constraints + +1. Currently, this tool only supports the following OS versions: openEuler 20.03 LTS SP1, openEuler 22.03 LTS, openEuler 22.03 LTS SP1, Kylin V10 SP1 (x86), Kylin V10 SP3 (x86) +2. In the online deployment mode, during the running of this tool, the rpm will be installed from the openEuler repo source or the source code resources will be downloaded from the external network. Therefore, the agent needs to be configured in advance before using the tool in the internal network environment to facilitate access to the external network environment. It is recommended after using the tool. Cancel the agent. +3. In offline deployment mode, the offline installation package and its dependent packages need to be downloaded from the external network. Therefore, the proxy needs to be configured in advance for the internal network environment to facilitate access to the external network environment. It is recommended to cancel the proxy after using the tool. + +### Environmental preparation instructions + +Prepare at least two machines (both physical machines and virtual machines) that meet the OS version and architecture requirements (see Constraint 1) and ensure that the network between machines can be connected normally (in online deployment mode, you need to connect to the external network). + +- Machine A: Production node, that is, the target node that needs to be monitored and operated. Business processes (such as databases, redis, and Java applications) are generally running on it and are used to deploy the observation component gala-gopher. + + ***Note: If there are multiple production nodes, gala-gopher needs to be deployed on each node*** + +- Machine B: Management node, used to deploy middleware such as kafka and gala's anomaly detection and root cause location components. The deployment of these components is relatively flexible, and multiple management nodes can be prepared to be deployed separately, as long as the network between the nodes is connected. + + ***Note: It is recommended that the machine specifications of the management node be at least 8U8G*** + +### Offline deployment + +The operation of gala components depends on various middlewares, so it is recommended to install and deploy in the following order (middleware->gala-gopher/gala-ops->grafana). + +#### Management node: deploy middleware + +The middleware currently involved includes six components: kafka, prometheus, arangodb, elasticsearch/logstash, and pyroscope. Among them, elasticsearch and logstash have dependencies and need to be bound and deployed. + +1. Offline installation package download + +Before offline deployment, you need to download 6 middleware installation packages on a machine that can connect to the external network. This tool provides offline resource download scripts and auxiliary scripts for one-click full download. After uploading the two scripts to the machine, execute the following commands to complete the download of relevant offline resources. The downloaded content will be stored in the subdirectory gala_deploy_middleware of the current directory. + +``` +sh download_offline_res.sh middleware [os_arch] +``` + +Optional options: + +- os_arch: Specifies to download the installation package of this architecture. When this item is not configured, the current system architecture is used. Supported architecture list: aarch64 x86_64 + +Note: Since kafka depends on java to run, java-1.8.0-openjdk and its dependent packages will also be downloaded when downloading the kafka installation package; the arangodb component needs to download the container image tar package, so the docker component needs to be installed on the downloading machine. + +2. One-click deployment of tools + +Upload all files and deployment scripts and auxiliary scripts under gala_deploy_middleware to the target management node machine, and execute the following commands to install, configure, and start the kafka/prometheus/elasticsearch/logstash/arangodb/pyroscope service, -K/-P/-E/ The -A/-p option supports separate deployment of corresponding components, and the -S option specifies the directory where the offline installation package is located. + +``` +sh deploy.sh middleware -K -P -E -A -p -S +``` + +Option details + +| options | Parameter Description | Is it required? | +| --------------- | ------------------------------------------------------------ | ------------------------------------------------------ | +| -K\|--kafka | Use this option to deploy the kafka server and configure the specified listening IP address (generally the IP of the current management node). When this option is not used, the kafka service is not deployed | Required when you need to deploy kafka service | +| | Use this option to deploy the prometheus server and configure the address list of the specified crawling message source (that is, the production node where gala-gopher is deployed). Each address is separated by a comma. The address can be followed by ":port number". Specify the capture port. When not specified, the default port 8888 is used; you can add "hostname-" before the address to identify the address. For example: -P 192.168.0.1,192.168.0.2:18001,vm01-192.168.0.3:18002. When this option is not used, the prometheus service is not deployed | Required when you need to deploy the prometheus server | +| -A\|--arangodb | Use this option to deploy and start the arangodb database service. The service listens to all IPs by default, so there is no need to specify the listening IP. | Required when deploying arangodb | +| -p\|--pyroscope | Use this option to deploy and start the pyroscope service. The service listens to all IPs by default, so there is no need to specify the listening IP. | Required when you need to deploy the pyroscope server | +| -E\|--elastic | Use this option to deploy elasticsearch and logstash services, and specify the address of the elasticsearch server where logstash reads messages (generally the IP of the current management node). When this option is not used, the elaticsearch service is not deployed | Elasticsearch/logstash needs to be deployed as a must | +| -S\|--srcdir | Use this option when deploying offline to specify the directory where the offline installation package is located. | Required for offline deployment | + +#### Production node: deploy gala-gopher + +This tool provides container image downloading, installation, and deployment of gala-gopher, and supports daemonset deployment in k8s clusters. + +This tool provides offline resource download scripts and auxiliary scripts to download and install resources with one click. Upload the two scripts to the machine and then execute the command to complete the download of relevant offline resources. + +1. Download the corresponding version of gala-gopher container image + +``` +sh download_offline_res.sh gopher [os_version] [os_arch] +``` + +os_version and os_arch can be configured at the same time (or use default values ​​at the same time): + +- os_version: Specifies the operating system version to download the gala-gopher container image. When this item is not configured, the current system version is used. Supported version list: openEuler-22.03-LTS-SP1 openEuler-22.03-LTS openEuler-20.03-LTS-SP1 kylin-v10-sp1 kylin-v10-sp3 +- os_arch: Specifies to download the gala-gopher container image of this architecture. When this item is not configured, the current system architecture is used. Supported architecture list: aarch64 x86_64 + +The downloaded container image tar package and other resources are stored in the gala_deploy_gopher directory. The tar package file name format is `gala-gopher-[os_arch]:[os_tag].tar` . The download content is as follows: + +``` +gala-gopher-aarch64:22.03-lts-sp1.tar +daemonset.yaml.tmpl +``` + +2. One-click deployment of tools + +Before deployment, all files and deployment scripts and auxiliary scripts in the gala_deploy_gopher directory need to be uploaded to the target node machine (the daemonset method needs to be uploaded to the master node of the k8s cluster). Execute the following commands to install, configure, and start the gala-gopher service, -S option to specify the directory where the offline installation package is located. + +- Container image deployment (applicable to single node) + +``` +sh deploy.sh gopher -K -p -S +``` + +- K8S daemonset deployment (applicable to clusters) + +``` +sh deploy.sh gopher -K -p -S --k8s +``` + +选项详细说明: Option details: + +| options | Parameter Description | Is it required? | +| --------------- | ------------------------------------------------------------ | ------------------------------- | +| -K\|--kafka | Specify the target kakfa server address for gala-gopher to report collected data (generally the IP of the management node). When this option is not configured, the kafka server address uses localhost. | NO | +| -p\|--pyroscope | Specify the pyroscope server address to which the flame graph is uploaded after gala-gopher turns on the flame graph function (used for docking front-end interface display) (generally the IP of the management node). When this option is not configured, the pyroscope server address uses localhost. | NO | +| -S\|--srcdir | Use this option when deploying offline to specify the directory where gala-gopher and its dependent packages are located. | Required for offline deployment | +| --k8s | gala-gopher Specify to deploy gala-gopher in the k8s cluster in daemonset mode | NO | + +#### Management node: deploy gala-ops + +1. gala-ops container image download + +Before offline deployment, you need to download the gala-ops (gala-anteater/gala-spider/gala-inference) container image tar package on a machine that can connect to the external network. This tool provides one-click full download of offline resource download scripts and auxiliary scripts. After uploading the two scripts to the machine, execute the following commands to complete the download of relevant offline resources. The downloaded content will be stored in the subdirectory gala_deploy_ops of the current directory. + +``` +sh download_offline_res.sh ops [os_arch] +``` + +Optional options: + +- os_arch: Specifies to download the container image of this architecture. When this item is not configured, the current system architecture is used. Supported architecture list: aarch64 x86_64 + +2. One-click deployment of tools + +Upload all files and deployment scripts and auxiliary scripts in the gala_deploy_ops directory to the target management node machine, and execute the following commands to install, configure, and start the gala-ops service. Use the -S option to specify the directory where the container image tar package is located. + +``` +sh deploy.sh ops -K -P -A -S +``` + +Option details: + +| options | Parameter Description | Is it required? | +| ---------------- | ------------------------------------------------------------ | ------------------------------- | +| -K\|--kafka | Specify the kakfa server address for gala-ops to read messages (generally the IP of the management node). When this option is not configured, the kafka server address uses localhost. | NO | +| -P\|--prometheus | Specify the prometheus server address for gala-ops to read messages (generally the IP of the management node). When this option is not configured, the prometheus server address uses localhost. | NO | +| -A\|--arangodb | Specify the arangodb server address where gala-ops stores relationship graph data (generally the IP of the management node). When this option is not configured, the arangodb server address uses localhost. | NO | +| -S\|--srcdir | Use this option during offline deployment to specify the directory where the gala-ops container image tar package is located. | Required for offline deployment | + +#### Management node: deploy grafana + +1. grafana container image and dependent python library download + +Before offline deployment, you need to download the grafana container image on a machine that can connect to the external network. This tool provides offline resource download scripts and auxiliary scripts for one-click full download. After uploading the two scripts to the machine, execute the following commands to complete the download of relevant offline resources. The downloaded content will be stored in the subdirectory gala_deploy_grafana of the current directory: + +``` +sh download_offline_res.sh grafana [os_arch] +``` + +Optional options: + +- os_arch: Specifies to download the container image of this architecture. When this item is not configured, the current system architecture is used. Supported architecture list: aarch64 x86_64 + +2. One-click deployment of tools + +Upload all files, deployment scripts and auxiliary scripts under gala_deploy_grafana to the target management node, and execute the following command to complete the deployment. Grafana will run as a container instance. + +``` +sh deploy.sh grafana -P -p -E -S --grafana_addr --grafana_addr_server_port +``` + +Option details: + +| options | Parameter Description | Is it required? | +| -------------------------- | ------------------------------------------------------------ | ------------------------------- | +| -P\|--prometheus | Specify the prometheus data source address in grafana (generally the IP of the management node). When this option is not configured, the prometheus data source uses localhost. | NO | +| -p\|--pyroscope | Specify the pyroscope data source address for reading flame graphs in grafana (generally the IP of the management node). When this option is not configured, the pyroscope data source uses localhost. | NO | +| | Specify the elasticsearch data source address (generally the IP of the management node) for reading anomaly detection, topology map, and root cause location results in grafana. When this option is not used, the elasticsearch data source uses localhost | NO | +| | Specify the grafana front-end address to facilitate external access to the grafana front-end page. When this option is not used, the default value is "http://localhost:3000" | NO | +| --grafana_addr_server_port | There is a grafana address server in the deployed container, which is used to obtain the grafana front-end address externally through the http interface. When this option is not used, the default value is 3010. For example, if the user executes `curl -X GET localhost:3010` , the command returns the grafana front-end address - "http://localost:3000" | NO | +| -S\|--srcdir | Use this option when deploying offline to specify the directory where the grafana installation package is located. | Required for offline deployment | + +### Online deployment + +#### Get deployment script + +Downloading separate deployment scripts and auxiliary scripts does not require downloading the entire tool. You can directly download them to the machine to be deployed through the following commands: + +``` +wget https://gitee.com/openeuler/gala-docs/raw/master/deploy/deploy.sh --no-check-certificate +wget https://gitee.com/openeuler/gala-docs/raw/master/deploy/comm.sh --no-check-certificate +``` + +#### Management node: deploy middleware + +Execute the following commands to install, configure, and start the kafka/prometheus/elasticsearch/logstash/arangodb/pyroscope service. The -K/-P/-E/-A/-p option supports separate deployment of corresponding components. -P is used for configuration. The prometheus server grabs a list of source addresses of messages (i.e., the production nodes where gala-gopher is deployed), and each address is separated by an English comma; due to dependencies, elasticsearch/logstash is uniformly controlled and bound to the installation through the -E option. + +``` +sh deploy.sh middleware -K -P -E -A -p +``` + +#### Production node: deploy gala-gopher + +Use the following commands to install, configure, and start the gala-gopher service: + +1. Container mirroring method (applicable to single node) + +``` +sh deploy.sh gopher -K -p +``` + +2. K8S daemonset deployment (applicable to clusters) + +``` +sh deploy.sh gopher -K -p --k8s +``` + +**Note: The daemonset method needs to be executed on the master node of the k8s cluster.** + +#### Management node: deploy gala-ops + +The gala-ops component supports rpm and container image deployment methods. You need to specify the kafka, prometheus, and arangodb server addresses when deploying. If not specified, the addresses of these middlewares use localhost by default. + +1. rpm mode (only supports openEuler 22.03 LTS SP1) + +``` +sh deploy.sh ops -K -P -A +``` + +2. Container image method: + +``` +sh deploy.sh ops -K -P -A --docker +``` + +#### Management node: deploy grafana + +Execute the following command to complete the front-end page deployment. Grafana will run as a container instance. + +``` +sh deploy.sh grafana -P -E +``` + +The gala-ops deployment demonstration video takes the openEuler 22.03 LTS version as an example to demonstrate the process of using the deployment tool to complete the deployment of gala-gopher on the generation node and the gala-ops component on the management node. + +After completing the above deployment actions, you can access "http://[deployment node IP]:3000" through the browser and log in to grafana to use A-Ops. The default login username and password are admin. The overall introduction video of A-Ops combines the grafana front-end display page to demonstrate the overall function of A-Ops. + +# Project roadmap + +A-Ops mainly selected 8 main scenarios and implemented related solutions in stages. gala-ops follows its scenario planning roadmap and defines its own feature implementation plan. Please refer to the figure below for related scenario roadmap and implementation features: + +![](./png/roadmap.png) + +# Feature introduction + +## Online application performance diagnosis + +### Feature background + +In a cloud environment, application performance is most affected by environmental factors such as load and resources. Such factors cannot be simulated in the laboratory, so online positioning capabilities are particularly important. There are two difficulties in application performance diagnosis: 1) Failure to identify application performance degradation ;2) Unable to determine the root cause of the problem. + +- Unable to identify application performance degradation + + For CSP vendors, this issue is as important as locating the root cause of the problem, because the services provided by CSP vendors have SLA commitments and proactively identify cloud service SLI performance degradation. For CSP vendors, problems can be discovered in advance and customer complaints can be avoided. , passive operation and maintenance is changed to active operation and maintenance. + + We use common DCS scenario cases among CSP vendors to introduce why it is difficult for CSP vendors to detect cloud service SLI performance degradation. + + > Distributed Cache Service (DCS) provides tenants with online distributed caching capabilities. Common applications include Redis, Memcached, etc. It is usually used to meet users' business requirements for high concurrency and fast data access. Common usage scenarios include e-commerce. , video live broadcast, game applications, social APP, etc. + + Currently, there are two common DCS SLI performance monitoring methods used by CSP vendors: 1) Dial-up test method to simulate tenant access; 2) Performance management within DCS application software; + + - Dial test method to simulate tenant access + +![](./png/DCS-1.png) + +The SLI of DCS obtained by dial test is actually different from the SLI experienced by real tenants when accessing DCS. The differences include differences in the network path, access method, and access frequency of service access. This difference leads to DCS performance monitoring distortion in this method. question. + +- Performance management within DCS application software + +![](./png/DCS-2.png) + +Direct performance management within DCS service applications (such as Redis) to obtain application performance seems to be a good choice, but the actual situation is unexpected. Tenants often complain that DCS service SLA does not meet the standards, but application layer monitoring still cannot find the problem. The reason is that the performance statistics of the application layer do not cover the factors that affect application performance at the system level, such as TCP packet loss/congestion, network delay, block device I/O delay, progress scheduling delay, etc. + +- Unable to determine root cause of problem + + Still taking the DCS scenario as an example, all cloud services provided by CSP vendors need to be accessed by tenants through the network. Network factors are crucial to the performance of cloud services. In addition to network factors, the greatest impact on applications includes I/O latency, scheduling latency, memory application latency, etc. These problems currently mainly rely on OS diagnostic tools to achieve problem delimitation/location. But there are some problems with OS diagnostic tools: + + - Tool fragmentation + + OS diagnostic tools are available in seven countries and eight countries. New tools emerge in an endless stream (BCC, Blktrace, iostat, netstat, etc.). The use of tools depends on the experience of operation and maintenance personnel to judge when, where and how to use the tools. Operation and maintenance efficiency depends on People experience. + + - Online environment usage is limited + + Most OS diagnostic tools cannot be resident in the system and rely on capturing diagnostic data at the fault site. When faced with random faults, the diagnostic tools have no way to start. In addition, when faced with a temporary sys CPU surge scenario, the system will temporarily fail. Diagnostic tools will be useless when logging in, commands cannot be executed, etc. In addition, some tools have problems such as requiring additional privileges and limited installation for use in online environments. + +### solution + +#### High-fidelity acquisition application performance SLI + +Google proposes the VALET method for the evaluation of cloud service SLI, which comprehensively evaluates application performance from five dimensions. We draw on its ideas and evaluate application performance from two perspectives: throughput (capacity) and latency (other dimensions may also be included in the evaluation scope later). + +![](./png/VALET.png) + +In order to improve versatility (avoiding strong language dependencies, avoiding application adaptation SDK modifications, etc.), gala-gopher provides a relatively universal application performance SLI collection method. We collect application performance data from the OS kernel TCP perspective (that is, theoretically This method applies to all TCP-based applications). + +- TCP layer collection application delay performance + + The difficulty in collecting delay performance is how to reduce the error effects on delay statistics caused by factors such as network retransmission, interruption delay, and scheduling delay. Referring to the figure below, gala-gopher will record the timestamp (TS1) of the arrival of the business request (access request [3]) at the kernel soft interrupt, and record the timestamp (TS2) of the application reading the business request at the system call. Wait for When the cloud service applies Response, it will execute the system call Write (TS3). Response will generate a TCP data stream and TCP_ACK (TS4) will be generated after the TCP data stream reaches the Request requester. Through the above four timestamps, we get: + + Application latency performance SLI: TS4 - TS1 1 + + Application receive direction delay: TS2 - TS1 [2] + + Application send direction delay: TS4 - TS3 [2] + + Through this method, we can achieve most application latency performance SLI and latency at different stages in the processing process (facilitating problem delimitation). + + ![](./png/tcp-1.png) + +[^1]: openEuler 22.03 SP1 version has been released. [^2]: openEuler version 23.09 is to be released. [^3]: It is assumed that when the cloud service application processes access requests, within a single TCP connection, it is processed in first-in, first-out order. If this assumption does not hold, the above delay collection may have errors. + +- TCP layer collection application throughput performance + + The difficulty of throughput performance collection is to identify short-term throughput decline. For example, in some scenarios, the 20ms periodic sliding window does not move during TCP data transmission, resulting in data transmission that usually completes in 1 to 3 seconds, and the performance deteriorates to more than 10 seconds to complete. . + + To give a vivid example, TCP throughput monitoring is like highway monitoring. It needs to continuously monitor whether there are idle highway resources per unit time on the highway. The smaller the unit time, the higher the monitoring accuracy. + + The noise floor and accuracy of this kind of monitoring data observation brings challenges. This part of the capability is planned to be launched in the innovative version of openEuler 23.09. + + Note: The application throughput collected by gala-gopher in openEuler 22.03 LTS SP1 version still comes from the application itself rather than the OS system level. + +#### Basic software low-level analysis + +According to the previous introduction, locating the root cause of the problem cannot be separated from the observation at the OS system level. In view of the limitations of existing tools, gala-gopher locates the OS system background service and provides all-round observation capabilities of the basic software. Based on eBPF technology, continuous, The low-noise method is to collect basic software runtime data (mainly Metrics type data). All collected performance Metrics data will carry application (i.e. process/thread) tags, enabling drill-down observation of system operating status from an application perspective. + +Example: + +``` + { + table_name: "tcp_abn", + entity_name: "tcp_link", + fields: + ( + { + description: "id of process", + type: "key", + name: "tgid", --> + }, + { + description: "role", + type: "key", + name: "role", --> + }, + { + description: "client ip", + type: "key", + name: "client_ip", --> tcp client IP + }, + { + description: "server ip", + type: "key", + name: "server_ip", --> tcp server IP + }, + { + description: "client port", --> + type: "key", + name: "client_port", + }, + { + description: "server port", + type: "key", + name: "server_port", + }, + { + description: "protocol", + type: "key", + name: "protocol", + }, + { + description: "comm", + type: "label", + name: "comm", --> + }, + { + description: "retrans packets", + type: "gauge", + name: "retran_packets", --> + }, + { + description: "drops caused by backlog queue full", + type: "gauge", + name: "backlog_drops", + }, + { + description: "sock drop counter", + type: "gauge", + name: "sk_drops", + }, + { + description: "tcp lost counter", + type: "gauge", + name: "lost_out", + }, + { + description: "tcp sacked out counter", + type: "gauge", + name: "sacked_out", + }, + { + description: "drops caused by socket filter", + type: "gauge", + name: "filter_drops", + }, + { + description: "counter of tcp link timeout", + type: "gauge", + name: "tmout_count", + }, + ..... + ) + } +``` + +The scope of data observation includes network, I/O, memory, scheduling, etc. For details, please refer to here. + +![](./png/basic-analysis.png) + +Combine application performance SLI and basic software low-level data observation to establish a large application performance model. The former is KPI and the latter is feature quantity. Online problem analysis is completed through relevant components in gala-ops to find the contribution value to application performance degradation. The largest feature amount (i.e. a certain basic software low-level Metrics) + +### Case presentation + +Databases are similar to DCS. They often encounter interference from I/O and network factors, causing fluctuations in application performance. Below we use openGauss as a demonstration case. + +Application Performance Diagnostics Video + +## System performance diagnostics + +### Feature background + +System performance diagnosis is mainly used to provide daily inspections for system maintenance SRE, and provide diagnostic capabilities including network (TCP), I/O and other performance degradation. Suitable for tracing random problems, such as network and I/O performance fluctuations, Socket queue overflow, DNS access failure, system call failure, system call timeout, process scheduling timeout, etc. + +Supported system performance diagnostic types can be found here. + +### solution + +System performance diagnosis is divided into two categories: + +- System error/insufficient resources + + This type of problem usually relies on some expert experience rules and user configuration thresholds to identify system problems. + + The specific support scope is as follows: + + - TCP related exception events: refer here. Specific contents include: tcp OOM, TCP packet loss/retransmission, TCP 0 window, TCP link establishment timeout, etc. + - Socket-related abnormal events: refer here. Specific contents include: listening queue overflow, Accept queue overflow, Syn queue overflow, active/passive link establishment failure, SYNACK retransmission, etc. + - Process-related exceptions: see here. Specific contents include: system call failure, DNS access failure, iowait timeout, BIO error, scheduling timeout, etc. + - I/O related exception events: refer here. Specific contents include: Request timeout, Request error, insufficient disk space, insufficient inode resources, etc. + +- System performance fluctuations + + This type of problem usually cannot be judged simply through some rules and thresholds to determine whether performance fluctuations occur. Therefore, some AI algorithms are needed for real-time detection. gala-ops establishes system performance KPIs and corresponding feature quantities, conducts offline training + online learning on the collected data, and implements online anomaly detection. For specific principles, please refer here. + + Abnormal events related to system performance fluctuations: refer here. Specifically, they include TCP link establishment performance fluctuations, TCP transmission latency performance fluctuations, system I/O latency performance fluctuations, process I/O latency performance fluctuations, and disk read and write latency performance fluctuations. + +### Case presentation + +System Performance Diagnostics Video + +### Interface introduction + +System performance diagnosis results can also be notified to the outside world in the form of kafka topics. The diagnosis results identify the specific observation entities and the reasons for the exceptions. + +- Example 1: Block observation entity exception in the host object: + + ``` + { + "Timestamp": 1586960586000000000, + "event_id": "1586xxx_xxxx" + "Attributes": { + "entity_id": "xx", + "event_id": "1586xxx_xxxx", + "event_type": "sys", + "data": [....], // optional + "duration": 30, // optional + "occurred count": 6,// optional + }, + "Resource": { + "metrics": "gala_gopher_block_err_code", + }, + "SeverityText": "WARN", + "SeverityNumber": 13, + "Body": "20200415T072306-0700 WARN Entity(xx) IO errors occured. (Block %d:%d, COMM %s, PID %u, op: %s, datalen %u, err_code %d, scsi_err %d, scsi_tmout %d)." + } + ``` + + After users subscribe to abnormal events through Kafka, they can manage them in tabular form and present the management in the form of time periods, as follows: + + | Time | Exception event ID | Observation entity ID | Metrics | description | + | ----------------- | ------------------ | --------------------- | -------------------------- | ------------------------------------------------------------ | + | 11:23:54 CST 2022 | 1586xxx_xxxx | xxx_xxxx | gala_gopher_block_err_code | 20200415T072306-0700 WARN Entity(xx) IO errors occured. (Block %d:%d, COMM %s, PID %u, op: %s, datalen %u, err_code %d, scsi_err %d, scsi_tmout %d). 20200415T072306-0700 WARN Entity(xx) IO errors occurred. (Block %d:%d, COMM %s, PID %u, op: %s, datalen %u, err_code %d, scsi_err %d, scsi_tmout %d). | + +## System I/O full stack observation + +### Feature background + +Distributed storage (including block storage, object storage, etc.) is an important service provided by CSP vendors. Common distributed storage services include EVS and OBS. Almost all CSP vendors have these cloud services, and these cloud services are also provided by other cloud services. Storage backend provider. Therefore, the operation and maintenance efficiency of distributed storage will determine the operation and maintenance efficiency of the entire system of CSP manufacturers. + +At the same time, the complex structure of the distributed storage system, the diversity of software sources, and the clustered and distributed deployment of the system all bring challenges to the operation and maintenance of this scenario. Specifically in: + +- Different business teams use different diagnostic tools, and there is a lack of data connection between the tools, resulting in low operation and maintenance efficiency. +- There is a lack of monitoring platform for cluster operation status from the I/O perspective, and cluster I/O fault diagnosis capabilities are insufficient. +- Lack of historical problem tracing capabilities and insufficient random fault diagnosis capabilities. + +### solution + +Draw the distributed storage cluster topology in real time from the I/O data flow perspective, and observe the I/O data flow of the distributed storage system from the full-stack I/O perspective. + +![](./png/io.png) + + Remark: + +1. In view of the diversity of distributed storage system software sources, ceph is used as an example to explain here. Different software solutions have different observation points. But the main idea is basically the same. +2. The system I/O full stack released by openEuler 22.03 SP1 is mainly for ceph scenarios (without SPDK acceleration), and its distributed storage scenarios will continue to be updated in subsequent updates. + +### Case presentation + +Distributed storage I/O full stack diagnosis video + +## Refined performance profiling + +### Feature background + +Users often encounter problems such as zombie processes, memory leaks, and CPU surges during daily operation and maintenance. These problems are manifested at the system level, but the root causes are often at the application level. In order to allow system operation and maintenance SRE to quickly delimit the scope of problems, gala-ops provides refined performance profiling capabilities, which supports long-term, online collection of system/application performance data, and can quickly diagnose including CPU surges and memory leaks (or continuous growth) , system call exceptions, insufficient resources and other problems. + +### solution + +Through high-frequency sampling of system stack data through eBPF + system perf events, the operating status of the faulty on-site system can be highly restored; system stack data can also be sampled based on system resource operation point Hooks to restore system resource usage in real time. + +Covers most programming languages ​​(including C/C++, GO, Rust, Java, etc.) and provides online and continuous full-stack information collection capabilities. + +![](./png/stack_trace.png) + +![img](https://gitee.com/c2x9/gala-docs/raw/master/png/stack_trace.png) + +- Sampling load evaluation: Taking data sampled once every 10ms as an example, the instructions for one sampling logic are estimated to be 1W (a CPU with 10MS instructions can probably execute 1KW instructions), the number of sampling instructions/the number of CPU executions per unit time = 1W/1KW 0.1%, So the sampling load is theoretically 0.1% (per core) +- Data storage evaluation: The sampled data needs to be saved for a period of time for periodic conversion into function symbols. Assuming the sampling frequency is 10ms and the conversion period is 1min, then the minimum sampling points to be retained are: 1min/10ms * single sampling data. That is about 6000 sampling points for a single core. Amplified evaluation shows that a single core has approximately 1.2W sampling points. + +gala-ops supports the use of the Grafana graphical interface to help customers better understand performance profiling results, including flame graphs and timeline graphs. + +### Case presentation + +Refined performance profiling video + +## K8S Pod full stack observability and diagnosis + +### Function description + +The GALA project will fully support K8S scenario fault diagnosis, providing features including application drill-down analysis, microservice & DB performance observability, cloud native network monitoring, cloud native performance profiling, process performance diagnosis and other features. + +- The K8S environment is easy to deploy: gala-gopher provides daemonset deployment, and each Work Node deploys a gala-gopher instance; gala-spider and gala-anteater are deployed to the K8S management Node in container mode. +- Application drill-down analysis: Provides fault diagnosis capabilities for sub-health problems in cloud-native scenarios, and completes problem demarcation between applications and cloud platforms in minutes. + - Full-stack monitoring: Provides application-oriented refined monitoring capabilities, covering cross-software stack observation capabilities such as language runtime (JVM), GLIBC, system calls, kernel (TCP, I/O, scheduling, etc.), and real-time viewing of the impact of system resources on applications Impact. + - Full-link monitoring: Provides network flow topology (TCP, RPC) and software deployment topology information. Based on this information, the system 3D topology is constructed to accurately view the resource range that the application depends on and quickly identify the fault radius. + - GALA causal AI: Provides visual root cause derivation capabilities, and can be delimited to resource nodes in minutes. + + ![](./png/k8s-monitor.png) + +- Microservice & DB performance observability: Provides non-intrusive microservice and DB access performance observability capabilities, including: + +- HTTP 1.x access performance is observable, including throughput, latency, error rate, etc. It supports API refined observability and HTTP Trace capabilities to facilitate viewing of abnormal HTTP request processes. +- PGSQL access performance is observable, including throughput, latency, error rate, etc. It supports refined observation capabilities based on SQL access and slow SQL Trace capabilities, making it easy to view specific SQL statements of slow SQL. + + ![](./png/db-monitor.png) + +Cloud native application performance profiling: Provides non-intrusive, zero-modification cross-stack profiling analysis tools, and can be connected to the pyroscope industry's common UI front-end. Technical features include: + +- Low noise floor: In benchmark test scenarios, the interference to applications is <2%. +- Multi-language: Supports common development languages ​​C/C++, Go, Rust, and Java. +- Multi-instance: Supports monitoring of multiple processes or containers at the same time, and the UI front-end can comparatively analyze the cause of the problem. +- Fine-grained: Supports specifying profiling scope, including processes, containers, and Pods. +- Multi-dimensional: Provides application profiling in different dimensions of OnCPU, OffCPU, and MemAlloc. + +![](./png/Pyroscope-UI.png) + +- Cloud native network monitoring: For K8S scenarios, it provides TCP, Socket, and DNS monitoring capabilities, and has more refined network monitoring capabilities. + +![](./png/network-monitor.png) + +- Process performance diagnosis: Middleware for cloud native scenarios (such as MySQL, Redis, etc.) provides process-level performance problem diagnosis capabilities, while monitoring process performance KPIs and process-related system layer Metrics (such as I/O, memory, TCP, etc.), complete Process performance KPI anomaly detection and system layer Metrics that affect the KPI (reasons that affect process performance). + +### Application scenarios + +![](./png/k8s-deploy.png) + +Deployment method: gala-gopher provides daemonset deployment, and each Work Node deploys a gala-gopher instance; gala-spider and gala-anteater are deployed to the K8S management Node in container mode. + +The relevant usage methods are as follows: + + gala-gopher daemonset deployment introduction + +Introduction to REST configuration + +# Features to be released + +1. System hidden danger inspection +2. Thread performance profiling (solve thread deadlocks, I/O bottlenecks and other difficult problems) + +# Q&A + +## How to use quickly + +How to quickly deploy the system? + +The community provides two types of quick deployment methods, 1) online deployment; 2) offline deployment; the former requires the user's installation environment to access the openEuler community; the latter can require the user to download the software and copy it to the installation environment; + +## Version matching + +1. What OS versions does gala-gopher support? + + - openEuler 22.03 SP1 and subsequent LTS versions are the versions officially recommended for customers; + + - LTS versions before openEuler 22.03 SP1 will also receive technical support from the openEuler community, but the community does not recommend customers to use them; + - Commercial OSs of the openEuler series (such as Kirin V10) will also receive technical support from the openEuler community, but commercial OS manufacturers have not yet provided commercial maintenance for gala-gopher; + - Technically, gala-gopher can be installed and deployed on non-openEuler series OS (such as SUSE 12, CentOS, etc.), but in principle, community technical support is not available; + +2. Which kernel versions does gala-gopher's observation capabilities support? + + | Kernel version | Observation capability range | + | -------------- | ------------------------------------------------------------ | + | 4.12 | Online performance flame graph, TCP, I/O, application L7 layer traffic, process, DNS | + | 4.18 | Online performance flame graph, TCP, I/O, application L7 layer traffic, process, DNS, Redis (Server side) latency performance, PG DB (Server side) latency performance, openGauss (Server side) latency performance | + | 4.19 | Online performance flame graph, TCP, I/O, application L7 layer traffic, process, DNS, Redis (Server side) latency performance, PG DB (Server side) latency performance, openGauss (Server side) latency performance | + | 5.10 | Online performance flame graph, TCP, I/O, application L7 layer traffic, process, DNS, Redis (Server side) latency performance, PG DB (Server side) latency performance, openGauss (Server side) latency performance | + +3. gDoes gala-gopher support cross-kernel version compatibility? + + - In the openEuler 22.03 SP1 version, gala-gopher does not support cross-kernel version compatibility; + - It is planned that in the openEuler 22.03 SP3 version, gala-gopher will support cross-Release version compatibility; (that is, within the 5.10 kernel version range, different release versions can use the same gala-gopher component); + - It is planned that in openEuler 23.03 SP1 version, gala-gopher will support cross-Release version compatibility in low kernel versions (4.18/4.19); + - It is planned that in 2024, gala-gopher will support compatibility across major kernel versions (for example, two kernel versions 5.10/4.19 can use the same gala-gopher software version); + +## Performance Testing + +1. gala-gopher noise floor data? + + Test Conditions: + + - Hardware environment: x86_64 architecture, 16 cores, 8G memory, virtual machine + - Software environment: openEuler-22.03-LTS operating system, 5.10.0 kernel + - Application observation scope: A total of 8 processes were observed, including: + - kafka server and benchmark client with background traffic (1000records/sec). + - Redis server and benchmark client with background traffic (30,000 requests/sec). + + The test results are as follows: + + | Test items | Single-core CPU usage (reported every 5 seconds) | + | ------------------------------------------------------------ | ------------------------------------------------ | + | Start the systeminfo probe alone | Average 1.5% | + | Start the proc probe individually | Average <1% | + | Start tcp probe individually | Average <1% | + | Start io probe separately | Average 2% | + | Start endpoint probes individually | Average <1% | + | Start jvm probe individually | Average 2% | + | Start the stackprobe probe individually | Average <1%, highest to 1.5% (interval 30s) | + | Start systeminfo, proc, tcp, io, endpoint, jvm, stackprobe probes | Average 5% | + +2. How many resources does the gala system require? + + | component | Deployment location | Resource requirements | + | -------------------- | ------------------- | ------------------------------------------------------------ | + | gala-gopher | production node | 0.2 core, 100M memory (only TCP and I/O collection capabilities are enabled) | + | gala-spider | Management node | | + | gala-anteater | Management node | | + | prometheus | Management node | | + | Grafana | Management node | | + | kafka | Management node | | + | elasticsearch | Management node | | + | logstash | Management node | | + | arangodb | Management node | Constraints: Only supports X86 architecture | + | pyroscope (optional) | Management node | | + +## Supported languages & protocol range + +| agreement | support | +| ------------ | ------------------------------------------------------------ | +| HTTP1.X | language:C/C++、Java、Go(TO BE)、Rust(TO BE); encryption library:openSSL、JSSE、GoSSL(TO BE)、Rustls(TO BE) | +| PostgreSQL | language:C/C++、Java、Go(TO BE)、Rust(TO BE); encryption library:openSSL、JSSE、GoSSL(TO BE)、Rustls(TO BE) | +| MySQL | language:C/C++、Java、Go(TO BE)、Rust(TO BE); encryption library:openSSL、JSSE、GoSSL(TO BE)、Rustls(TO BE) | +| Dubbo | language:C/C++、Java、Go(TO BE)、Rust(TO BE); encryption library:openSSL、JSSE、GoSSL(TO BE)、Rustls(TO BE) | +| Redis Radish | TO BE | +| Kafka | TO BE | +| HTTP 2.0 | TO BE | +| MongoDB | TO BE | + +## Support virtual machine & container & K8S environment + +1. Supports multiple container runtimes: including containerd, docker, and isula; can monitor container instances for these three container runtime scenarios. + +2. Does gala-gopher support K8S deployment and supervision? + + ggala-gopher supports deployment in the form of daemonset in the K8S environment. For related daemonset yaml, container image Dockerfile, and container startup entry script, please refer to: + + https://gitee.com/openeuler/gala-gopher/blob/dev/k8s/daemonset.yaml.tmpl https://gitee.com/openeuler/gala-gopher/blob/dev/build/Dockerfile_2003_sp1_x86_64 https://gitee.com/openeuler/gala-gopher/blob/dev/build/Dockerfile_2003_sp1_aarch64 https://gitee.com/openeuler/gala-gopher/blob/dev/build/entrypoint.sh + +3. The collected data and reported events will carry system context tags as follows: + + Node label: System ID (unique within the cluster), management IP. + + Device label: Device Name (such as network card, disk). + + Process label: process ID, process name, cmdline. + + Network labels: TCP client/server ip, server port, role label. + + Container labels: container ID, container name. + + PODlabels:POD ID,POD IP,Pod Name,Pod Namespace labels。 + + User-defined label: The label set by the user when the collection task is issued. User-defined tags are set through the dynamic configuration interface provided by gala-gopher. For details, see the link: Configuring probe extension tags. + +4. Supports automatic expansion of public label information such as containers and PODs for process-level/container-level metrics. + + The gala-gopher framework side will identify whether it is a process-level or container-level metric based on the metric data reported by the probe, and will supplement the process-level/container-level metric with container label and POD label information. The probe side does not need to collect these additional information separately. Label. + + - Process-level metric: If the metric data collected by the probe includes the `tgid` tag, the metric is a process-level metric. For process-level metrics, the gala-gopher framework will supplement container public labels (including container ID, container name) and POD public labels (including POD ID, Pod Name, Pod Namespace). + - Container-level metric: If the metric data collected by the probe includes the `container_id` tag, the metric is a container-level metric. For container-level metrics, the gala-gopher framework will supplement container public labels (including container name) and POD public labels (including POD ID, Pod Name, and Pod Namespace). + +## Java environment support level + +Comprehensive support for Java environment observability capabilities, including support for Java application performance profiling, network access for Java encrypted applications, and JVM data collection. + +## How to build a system topology + +Topology construction principle: gala-gopher will provide L4 layer network flow, load sharing flow, L7 layer network flow, software deployment topology and other information, and build the system 3D topology based on this information. + +![](./png/topo_theory.png) + +System topology purpose: Mainly used for gala causal AI to derive root cause location, complete fault source tracing based on TCP/RPC topology, and complete application-oriented drill-down fault root cause location based on deployment topology. + +- Key question 1: How to solve the problem that the network flow topology caused by network NAT cannot be established? + + The TCP/RPC flow topology information (IP/Port information before NAT) collected by gala-gopher will be NAT converted based on the linux conntrack table information, and the network flow topology information after NAT will be sent. + +- Key question 2: How to establish the network flow topology of middleware? + gala-gopher provides load balancing flow monitoring capabilities for network load balancing middleware such as Nginx/Haproxy, and can quickly complete load sharing topology construction based on TCP information; + + gala-gopher provides message body (topic) flow monitoring capabilities for message bus middleware such as kafka, and can quickly build a message producer and consumer topology. + +- Key question 3: How to establish K8S cluster topology? + + To be added. + +## Middleware that gala depends on + +| middleware | function | Substitutability analysis | +| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| prometheus | Store gala-gopher metrics data in a time-series manner, connect to gala-ops for data processing, and connect to grafana page display | It is necessary and cannot be replaced when using the topology map, anomaly detection, and root cause location functions. It is not necessary if you only need gala-gopher metrics data. You can modify the metrics output mode in the gala-gopher configuration file to logs and obtain it from the local directory /var/log/gala-gopher/metrics. | +| kafka | Store gala-gopher sub-health inspection, observation object metadata, anomaly detection output, root cause location results and other data for users or gala-ops internal components to subscribe and obtain | It is necessary and cannot be replaced when using the topology map, anomaly detection, and root cause location functions. It is not necessary to use the gala-gopher sub-health inspection function. You can change the event output mode in the gala-gopher configuration file to logs and obtain it from the local directory /var/log/gala-gopher/event. | +| elasticsearch | Store gala-gopher sub-health inspection, anomaly detection output, root cause location results, topology map data and display them to the grafana front end | It is necessary and irreplaceable to display sub-health inspection, anomaly detection output, root cause location results, and topology map data on the grafana page. | +| logstash | Preprocess kafka messages and store them in elasticsearch | It is necessary and irreplaceable to display sub-health inspection, anomaly detection output, root cause location results, and topology map data on the grafana page. | +| arangodb | Store real-time topology data generated by gala-spider | It is required and cannot be replaced when using the topology map function. | +| pyroscope | The gala-gopher flame graph data is stored in a time-series manner. The built-in front-end page provides functions such as real-time preview, filtering, and horizontal comparison of the flame graph, and is connected to the grafana page to display the flame graph. | Not required, the flame graph file can be obtained directly from the local directory /var/log/gala-gopher/stackstrace | + +## How to use performance flame graphs + +1. Performance flame graph startup command example (basic): + +``` +curl -X PUT http://localhost:9999/flamegraph -d json='{ "cmd": {"probe": ["oncpu"] }, "snoopers": {"proc_name": [{ "comm": "cadvisor"}] }, "state": "running"}' +``` + +The above command is the simplest flame graph probe startup command. This probe uses default parameters to sample the CPU usage of the cadvisor process, and can periodically generate a flame graph in svg format locally. The generated flame graph file is opened as shown below. In the picture, you can see the go and c call stacks of the cadvisor process. + +1. Performance flame graph startup command example (advanced): + +A more complete startup command example is as follows. Customized configuration of the flame graph probe can be achieved by manually setting various parameters. For complete configurable parameters of the flamegraph probe, see Probe Operation Parameters. + +``` +curl -X PUT http://localhost:9999/flamegraph -d json='{ "cmd": { "check_cmd": "", "probe": ["oncpu", "offcpu", "mem"] }, "snoopers": { "proc_name": [{ "comm": "cadvisor", "cmdline": "", "debugging_dir": "" }, { "comm": "java", "cmdline": "", "debugging_dir": "" }] }, "params": { "perf_sample_period": 100, "svg_period": 300, "svg_dir": "/var/log/gala-gopher/stacktrace", "flame_dir": "/var/log/gala-gopher/flamegraph", "pyroscope_server": "localhost:4040", "multi_instance": 1, "native_stack": 0 }, "state": "running"}' +``` + +The above command can start a flame graph probe, specify a frequency of 100ms to sample the cadvisor and java processes, and generate three svg format flame graphs locally every 300s: cpu flame graph, offcpu flame graph, memory flame graph, at the same time The above flame graph data is reported to the pyroscope server through a multi-process instance. You can view it through pyroscope in real time, or you can configure the pyroscope data source in grafana, and then view the flame graph on grafana. + +1. Example of how to view performance flame graph: + +Once you start the flame graph with the above command, you can view the flame graph in real time with different filtering options. The picture below is an example of viewing the memory flame graph of a k8s Pod in a certain Java business. + +You can select machine_id at (1) to view the process flame graph on different hosts. + +You can select the process ID at (2) to view the flame graphs of different processes. + +You can select the flame graph type (oncpu/offcpu/mem) at (3) to view different types of flame graphs. + +(6) is the label of the process, the format is [pid]comm. If it is a process in a k8s container, there will be [Pod]name and [Con]name labels, see (4) (5). + +The time period of the flame graph can be selected at (7). + +1. Use performance flame graphs to locate performance issues: + +The display form of the flame chart can also be flexibly configured according to business needs. For example, when a cloud service releases a grayscale version, both old and new versions of containers are deployed in the environment. By viewing and comparing flame graphs of container instances of different versions at the same time, you can quickly discover operating differences between versions and quickly locate performance issues. + +In the following practice, two versions of Kafka client containers, old and new, were deployed on a host. The test found that the CPU usage of the new version of the container was slightly higher than that of the old version. + +So configure the flame graph display of these two containers in grafana at the same time, and select the time period that needs to be positioned. It can be clearly seen from the figure that the new version (right) has more calls to String.format than the old version (left). This indicates that the new version of the container is due to the addition of serialization operations in the code, which leads to an increase in CPU usage. . + +![](./png/compare_fg.png) + +## How to use to diagnose network problems + +Reference here + +## How to Diagnose I/O Problems + +Reference here + +## How to diagnose JAVA OOM problems + +To be added + +# Partner + +![](./png/partner.png) -- Gitee