From aa2b49848e2a945367853bd57b36223dc2e4eab9 Mon Sep 17 00:00:00 2001
From: lihengwei <lihengwei@uniontech.com>
Date: Tue, 12 Mar 2024 16:27:56 +0800
Subject: [PATCH] feature: adjust README.en.md

Signed-off-by: lihengwei <lihengwei@uniontech.com>
---
 README.en.md | 1021 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 997 insertions(+), 24 deletions(-)

diff --git a/README.en.md b/README.en.md
index 3ff995f..cb7073a 100644
--- a/README.en.md
+++ b/README.en.md
@@ -1,36 +1,1009 @@
 # gala-docs
 
-#### Description
-Handbook and requirements documentation
+#### Introduction
 
-#### Software Architecture
-Software architecture description
+![](./png/logo.png)
 
-#### Installation
+gala is a C/S architecture, AI-based operating system sub-health diagnostic  tool. It is based on eBPF + java agent non-intrusive observation  technology, and is assisted by AI technology to achieve minute-level  diagnosis of sub-health faults (such as performance jitters, increased  error rates, system freezes, etc.) and simplify the operation and  maintenance process of IT infrastructure.
 
-1.  xxxx
-2.  xxxx
-3.  xxxx
+# background
 
-#### Instructions
+In recent years, with the implementation of cloud native, serverless and  other technologies, the complexity of cloud infrastructure operation and maintenance has become more and more challenging, especially the  characteristics of sub-health problems (intermittent occurrence, short  duration, There are many types of problems and a wide range of problems, etc.), which brings important challenges to cloud infrastructure fault  diagnosis. The challenges of sub-health fault diagnosis (including  observability, massive data management capabilities, AI algorithm  generalization capabilities, etc.) have become particularly prominent in Linux scenarios. In the openEuler open source operating system, the  existing operation and maintenance methods are not enough to detect and  locate sub-health problems in a timely manner. The problems include:  lack of online and continuous monitoring capabilities; lack of  observation capabilities with refined application perspective; lack of  full-stack observation data-based Issues such as automation and AI  analysis capabilities. However, the difficulties in diagnosing  sub-health faults include:
 
-1.  xxxx
-2.  xxxx
-3.  xxxx
+- Full-stack non-intrusive observability.
+- Continuous, refined, low-load monitoring capabilities.
+- Adaptable anomaly detection and visual fault derivation capabilities for different application scenarios.
 
-#### Contribution
+# Project Description
 
-1.  Fork the repository
-2.  Create Feat_xxx branch
-3.  Commit your code
-4.  Create Pull Request
+The overall architecture of gala is as shown in the figure. It is a C/S  architecture as a whole. On the production node gala-gopher is a Linux  background program, which is responsible for providing full-scenario,  full-stack (including Metrics, Events, Tracing, etc.) data collection.  It supports OpenTelemetry open ecological interface (supports prometheus exporter, kafka client, etc.) The data is passed to the management  node. The management node deploys gala-spider and gala-anteater  components, which are respectively responsible for cluster topology  calculation and visual root cause derivation;
 
+The gala architecture relies on some open source middleware (including  prometheus, kafka, Elastic, etc.), but it can also be connected to the  existing middleware of the customer's IT system. The gala architecture  design provides integration capabilities and can be integrated by  industry customers' IT operation and maintenance systems. It provides  two types of integration methods:
 
-#### Gitee Feature
+- Software ecological integration method: You can only use gala-gopher  observability capabilities (OpenTelemetry method to obtain data), or you can use all capabilities to obtain observation data, anomaly detection  results, and visual derivation results through middleware such as  prometheus, Elastic, and kafka.
+- Tool integration method: Integrate the capabilities provided by gala into  the customer's IT operation and maintenance system in the form of  Grafana.
 
-1.  You can use Readme\_XXX.md to support different languages, such as Readme\_en.md, Readme\_zh.md
-2.  Gitee blog [blog.gitee.com](https://blog.gitee.com)
-3.  Explore open source project [https://gitee.com/explore](https://gitee.com/explore)
-4.  The most valuable open source project [GVP](https://gitee.com/gvp)
-5.  The manual of Gitee [https://gitee.com/help](https://gitee.com/help)
-6.  The most popular members  [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
+![](./png/gala-arch.png)
+
+ gala can provide customers with the following operation and maintenance capabilities:
+
+- Online application performance jitter diagnosis: Provides online application  performance diagnosis capabilities for databases, including network  problems (packet loss, retransmission, delay, TCP zero window, etc.),  I/O problems (slow disk, I/O performance degradation, etc.) ) problems,  scheduling problems (including sysCPU surge, deadlock, etc.), memory  problems (OOM, leaks, etc.), etc.
+- System performance bottleneck diagnosis: Provides the ability to diagnose TCP  and I/O performance jitter problems in general scenarios.
+- System hazard inspection: Provides kernel protocol stack packet loss,  virtualized network packet loss, TCP exceptions, I/O delay exceptions,  system call exceptions, resource leaks, JVM exceptions, and application  RPC exceptions (including error rates for 8 common protocols) , delay,  etc.) hardware faults (UCE, disk media errors, etc.) and other  second-level inspection capabilities.
+- System full-stack I/O observability: Provides full-stack I/O observation  capabilities for distributed storage scenarios, including GuestOS  process-level and block-layer I/O observation capabilities, as well as  virtualization layer storage front-end I/O observation capabilities.  Distributed storage backend I/O observation capabilities.
+- Refined performance profiling: Provides multi-dimensional (including multiple  dimensions such as system, process, container, Pod, etc.),  high-precision (10ms sampling period) performance (including CPU  performance, memory usage, resource usage, system calls, etc.) flame  graphs , timeline chart, which can be collected continuously online in  real time.
+- K8S Pod full-stack observability and diagnosis: Provides real-time topology capabilities of Pod cluster business flows from the K8S perspective,  Pod performance observation capabilities, DNS observation capabilities,  SQL observation capabilities, etc.
+
+The key technologies involved in gala include the following:
+
+- Integrated non-invasive observation technology: Integrate the advantages of  different observation technologies such as eBPF and Java agent to  achieve multi-language (supporting mainstream languages ​​such as C/C++, Java, Go, etc.), full software stack (including kernel, system calls,  basic library Glibc, and runtime) jvm, basic middleware Nginx/Haproxy,  etc.).
+- Process topology: Based on time-series data (L4/L7 layer traffic, etc.),  real-time calculation generates time-series topology structure and  dynamically displays business cluster topology changes.
+- Visualized root cause location: The statistical inference model combines the full  process topology to achieve visual and minute-level problem root cause  diagnosis.
+
+# Application scenarios
+
+ gala is mainly oriented to scenarios in Linux environments such as  openEuler, including databases, distributed storage, virtualization,  cloud native and other scenarios. Help customers in finance,  telecommunications, Internet and other industries to achieve  minute-level diagnosis of sub-health faults based on full-stack  observability.
+
+# Project code warehouse
+
+https://gitee.com/openeuler/gala-gopher
+
+https://gitee.com/openeuler/gala-spider
+
+https://gitee.com/openeuler/gala-anteater
+
+# Quick installation
+
+## architecture
+
+gala has a C/S architecture and can be deployed in a cluster or on a  stand-alone basis. The entire architecture consists of gala-gopher and  gala-ops. In cluster mode, gala-gopher is installed in the production  node, and gala-ops is installed in the management node; in stand-alone  mode, both are installed in the production node.
+
+Among them, gala-ops software includes gala-spider, gala-anteater, and gala-inference components.
+
+![](./png/csp_arch.png)
+
+## gala-gopher
+
+### positioning
+
+- Data collector: Provides low-level data collection of application  granularity, including collection of system indicators in network, disk  I/O, scheduling, memory, security, etc., and is also responsible for  collection of application KPI data. Data types include logging, tracing, and metrics.
+- System anomaly detection: Provides system anomaly detection capabilities,  covering scenario system anomalies in network, disk I/O, scheduling,  memory, etc. Users can set the upper and lower limits of exceptions  through thresholds.
+- Performance hotspot analysis: Provides CPU, memory, and IO flame graphs.
+
+### Principles and terminology
+
+gala-gopher software architecture reference here, it is a low-load probe framework  based on eBPF technology. In addition to its own data collection, users  can freely extend third-party probes.
+
+**terminology**
+
+- Probe: A program in gala-gopher that performs specific data collection tasks,  including native and extend probes. The former starts the data  collection task separately in thread mode, and the latter starts the  data collection task in sub-process mode. gala-gopher can start some or  all probes through configuration modification.
+- Observation entity (entity_name): used to define the observation object in the  system. All data collected by the probe will be attributed to a specific observation entity. Each observation entity is composed of key, label  (optional), and metrics. For example, the key of the tcp_link  observation entity includes information such as process number, IP  quintuple, protocol family, etc., and metrics includes running status  indicators such as tx, rx, rtt, etc. In addition to the natively  supported observation entities, gala-gopher can also extend observation  entities.
+- Data table (table_name): The observation entity is composed of one or more  data tables. Usually one data table is completed by one collection task. It can be seen that a single observation entity can be completed by  multiple collection tasks.
+- Meta file: Define observation entities (including internal data tables)  through files. Meta files in the system must be unique and definitions  must not conflict. Specification reference here.
+
+### Supported technologies
+
+Collection range: refer to here. Covers RED (Request, Error, Delay) data  observation of network, I/O, memory, network card, scheduling, Redis,  kafka, Nginx and other kernels and basic software.
+
+System exception scope: refer to here. Covers automatic inspection and  reporting capabilities for more than 60 system hidden danger points,  including TCP, Socket, process/thread, I/O, scheduling, etc.
+
+### Installation and use
+
+Reference here
+
+### Expand the scope of data collection
+
+If users want to expand the scope of data collection, they only need to  perform two steps: define observation entities and integrate data  probes.
+
+- **Define observation entities**
+
+By defining an observation entity (or updating the original observation  entity), it is used to carry newly collected metrics data. The user  defines the key, label (optional), and metrics of the observation entity through the meta file (refer here). After the definition is completed,  the meta file is archived in the probe directory.
+
+- **Integrated data probe**
+
+Users can package the data collection software through various programming  languages ​​(shell, python, java, etc.), and output the collected data  through the Linux pipe character in the script according to the meta  file definition format, refer to here.
+
+Refer to the cAdvisor third-party probe integration case.
+
+## gala-spider
+
+### positioning
+
+- Topology map construction: Provides OS-level topology map construction function. It will regularly obtain the data of all observation object instances  collected from gala-gopher, calculate the topological relationship  between them, and finally save the generated topology map to the graph  database arangodb. middle.
+
+### Principles and terminology
+
+ Reference here.
+
+### Supported technologies
+
+**Supported topological relationship types**
+
+There are often physical or logical relationships between OS observation  entities. For example, threads and processes have affiliations, and  processes often have connection relationships. Therefore, gala-spider  defines some common topological relationship types. For details, see  gala-spider design document: Relationship type definition. After  defining the topological relationship type, you can then define the  topological relationship between observed entities and then build a  topological diagram.
+
+**List of supported entity relationships**
+
+gala-spider defines some topological relationships between observation entities by  default. These topological relationships are configurable and  extensible. For details, see the gala-spider design document: Supported  topological relationships.
+
+### Installation and use
+
+Reference here.
+
+### Extend observation entities and relationships
+
+Reference here.
+
+## gala-anteater
+
+### positioning
+
+- Anomaly detection: For the operating system, it provides minute-level anomaly  detection capabilities, which can promptly detect system-level anomalies that potentially affect client latency, assist operation and  maintenance personnel, and quickly track and solve problems.
+- Abnormal reporting: When abnormal behavior is discovered, the platform can  report it to Kafka in real time. Operation and maintenance personnel  only need to subscribe to the Kafka message queue to understand whether  the current system is at risk.
+
+### Principles and terminology
+
+gala-anteater is an AI-based operating system anomaly detection platform. It mainly  covers functions such as time series data preprocessing, abnormal point  discovery, and exception reporting. Based on offline pre-training,  incremental learning and model updating of online models, it can be well adapted to fault diagnosis of multi-dimensional and multi-modal data.
+
+- Fundamental 
+
+  Through the combination of online and offline, online learning technology is  used to realize offline learning of the model, online update, and  application to online anomaly detection.
+
+  **Offline**: First, use the offline historical KPI data set to obtain a training set after data preprocessing and feature selection; then, use the obtained  training set to train and tune the unsupervised neural network model  (such as Variational Autoencoder). Finally, the optimal model is  selected using the manually labeled test set.
+
+  **Online**:  Deploy the model trained offline to online, then use real online data  sets to conduct online training and parameter tuning of the model, and  then use the trained model to perform real-time anomaly detection in the online environment.
+
+ ![](./png/anteater_arch.png)
+
+### Installation and use
+
+Reference here
+
+## gala-inference
+
+### positioning
+
+- Root cause positioning: Provides root cause positioning capabilities for  abnormal KPIs. It is based on the results of anomaly detection and the  topology map as input, and outputs the results of root cause positioning to kafka.
+
+### Principles and terminology
+
+Reference here.
+
+### Supported technologies
+
+**expert rules**
+
+In order to improve the accuracy and interpretability of root cause  location results, we analyzed some actual causal relationships between  observed entities in the operating system field, and summarized some  general expert rules to guide subsequent root cause determination.  Location Algorithm. Details of these general expert rules can be found  in the gala-inference design document: Expert Rules.
+
+### Installation and use
+
+Reference here.
+
+## gala system integration
+
+gala also relies on some open source software, including kafka, arangodb,  prometheus, etc. The following figure introduces the gala system  integration relationship. Kafka is used to transmit logs/tracing data to ES/logstash/jaeger, prometheus is used to store Metrics data, Arangodb  is used to store real-time topology data, and grafana is used for  front-end page display.
+
+![](./png/system_integration.png)
+
+## gala system installation
+
+Gala provides an integrated deployment tool Gala-Deploy-Tools for users to  quickly deploy gala-gopher, gala-ops  (gala-spider/gala-inference/gala-anteater) components,  kafka/prometheus/arangodb/es/logstash/pyroscope intermediate The  software and grafana front-end pages display related components and  support both offline/online deployment modes.
+
+- kafka is used to transmit gala software data
+- prometheus is used to store gopher metrics data
+- arangodb is used to store real-time topology data generated by gala-spider
+- elasticsearch/logstash stores gala data and supports grafana front-end display
+- pyroscope stores gopher flame graph data
+- grafana displays gala front-end page
+
+### Constraints
+
+1. Currently, this tool only supports the following OS versions: openEuler 20.03 LTS  SP1, openEuler 22.03 LTS, openEuler 22.03 LTS SP1, Kylin V10 SP1 (x86),  Kylin V10 SP3 (x86)
+2. In the online deployment mode, during the running of this tool, the rpm  will be installed from the openEuler repo source or the source code  resources will be downloaded from the external network. Therefore, the  agent needs to be configured in advance before using the tool in the  internal network environment to facilitate access to the external  network environment. It is recommended after using the tool. Cancel the  agent.
+3. In offline deployment mode, the offline installation package and its  dependent packages need to be downloaded from the external network.  Therefore, the proxy needs to be configured in advance for the internal  network environment to facilitate access to the external network  environment. It is recommended to cancel the proxy after using the tool.
+
+### Environmental preparation instructions
+
+Prepare at least two machines (both physical machines and virtual machines)  that meet the OS version and architecture requirements (see Constraint  1) and ensure that the network between machines can be connected  normally (in online deployment mode, you need to connect to the external network).
+
+- Machine A: Production node, that is, the target node that needs to be monitored and operated. Business processes (such as databases, redis, and Java  applications) are generally running on it and are used to deploy the  observation component gala-gopher.
+
+  ***Note: If there are multiple production nodes, gala-gopher needs to be deployed on each node***
+
+- Machine B: Management node, used to deploy middleware such as kafka and gala's  anomaly detection and root cause location components. The deployment of  these components is relatively flexible, and multiple management nodes  can be prepared to be deployed separately, as long as the network  between the nodes is connected.
+
+  ***Note: It is recommended that the machine specifications of the management node be at least 8U8G***
+
+### Offline deployment
+
+The operation of gala components depends on various middlewares, so it is  recommended to install and deploy in the following order  (middleware->gala-gopher/gala-ops->grafana).
+
+#### Management node: deploy middleware
+
+The middleware currently involved includes six components: kafka,  prometheus, arangodb, elasticsearch/logstash, and pyroscope. Among them, elasticsearch and logstash have dependencies and need to be bound and  deployed.
+
+1. Offline installation package download
+
+Before offline deployment, you need to download 6 middleware installation  packages on a machine that can connect to the external network. This  tool provides offline resource download scripts and auxiliary scripts  for one-click full download. After uploading the two scripts to the  machine, execute the following commands to complete the download of  relevant offline resources. The downloaded content will be stored in the subdirectory gala_deploy_middleware of the current directory.
+
+```
+sh download_offline_res.sh middleware [os_arch]
+```
+
+Optional options:
+
+- os_arch: Specifies to download the installation package of this architecture.  When this item is not configured, the current system architecture is  used. Supported architecture list: aarch64 x86_64
+
+Note: Since kafka depends on java to run, java-1.8.0-openjdk and its  dependent packages will also be downloaded when downloading the kafka  installation package; the arangodb component needs to download the  container image tar package, so the docker component needs to be  installed on the downloading machine.
+
+2. One-click deployment of tools
+
+Upload all files and deployment scripts and auxiliary scripts under  gala_deploy_middleware to the target management node machine, and  execute the following commands to install, configure, and start the  kafka/prometheus/elasticsearch/logstash/arangodb/pyroscope service,  -K/-P/-E/ The -A/-p option supports separate deployment of corresponding components, and the -S option specifies the directory where the offline installation package is located.
+
+```
+sh deploy.sh middleware -K <deploy node management IP> -P <prometheus grab source address 1[,prometheus grab source address 2,prometheus grab source address 3,...]> -E <deploy node management IP> -A -p -S <middleware installation package directory>
+```
+
+Option details
+
+| options         | Parameter Description                                        | Is it required?                                        |
+| --------------- | ------------------------------------------------------------ | ------------------------------------------------------ |
+| -K\|--kafka     | Use this option to deploy the kafka server and configure the specified  listening IP address (generally the IP of the current management node).  When this option is not used, the kafka service is not deployed | Required when you need to deploy kafka service         |
+|                 | Use this option to deploy the prometheus server and configure the address  list of the specified crawling message source (that is, the production  node where gala-gopher is deployed). Each address is separated by a  comma. The address can be followed by ":port number". Specify the  capture port. When not specified, the default port 8888 is used; you can add "hostname-" before the address to identify the address.  For example: -P 192.168.0.1,192.168.0.2:18001,vm01-192.168.0.3:18002. When  this option is not used, the prometheus service is not deployed | Required when you need to deploy the prometheus server |
+| -A\|--arangodb  | Use this option to deploy and start the arangodb database service. The  service listens to all IPs by default, so there is no need to specify  the listening IP. | Required when deploying arangodb                       |
+| -p\|--pyroscope | Use this option to deploy and start the pyroscope service. The service  listens to all IPs by default, so there is no need to specify the  listening IP. | Required when you need to deploy the pyroscope server  |
+| -E\|--elastic   | Use this option to deploy elasticsearch and logstash services, and specify  the address of the elasticsearch server where logstash reads messages  (generally the IP of the current management node). When this option is  not used, the elaticsearch service is not deployed | Elasticsearch/logstash needs to be deployed as a must  |
+| -S\|--srcdir    | Use this option when deploying offline to specify the directory where the offline installation package is located. | Required for offline deployment                        |
+
+#### Production node: deploy gala-gopher
+
+This tool provides container image downloading, installation, and deployment of gala-gopher, and supports daemonset deployment in k8s clusters.
+
+This tool provides offline resource download scripts and auxiliary scripts  to download and install resources with one click. Upload the two scripts to the machine and then execute the command to complete the download of relevant offline resources.
+
+1. Download the corresponding version of gala-gopher container image
+
+```
+sh download_offline_res.sh gopher [os_version] [os_arch]
+```
+
+os_version and os_arch can be configured at the same time (or use default values ​​at the same time):
+
+- os_version: Specifies the operating system version to download the gala-gopher  container image. When this item is not configured, the current system  version is used. Supported version list: openEuler-22.03-LTS-SP1  openEuler-22.03-LTS openEuler-20.03-LTS-SP1 kylin-v10-sp1 kylin-v10-sp3
+- os_arch: Specifies to download the gala-gopher container image of this  architecture. When this item is not configured, the current system  architecture is used. Supported architecture list: aarch64 x86_64
+
+The downloaded container image tar package and other resources are stored  in the gala_deploy_gopher directory. The tar package file name format is `gala-gopher-[os_arch]:[os_tag].tar` . The download content is as follows:
+
+```
+gala-gopher-aarch64:22.03-lts-sp1.tar
+daemonset.yaml.tmpl
+```
+
+2. One-click deployment of tools
+
+Before deployment, all files and deployment scripts and auxiliary scripts in  the gala_deploy_gopher directory need to be uploaded to the target node  machine (the daemonset method needs to be uploaded to the master node of the k8s cluster). Execute the following commands to install, configure, and start the gala-gopher service, -S option to specify the directory  where the offline installation package is located.
+
+- Container image deployment (applicable to single node)
+
+```
+sh deploy.sh gopher -K <kafka server address> -p <pyroscope server address> -S <offline installation package directory>
+```
+
+- K8S daemonset deployment (applicable to clusters)
+
+```
+sh deploy.sh gopher -K <kafka server address> -p <pyroscope server address> -S <offline installation package directory> --k8s
+```
+
+选项详细说明： Option details:
+
+| options         | Parameter Description                                        | Is it required?                 |
+| --------------- | ------------------------------------------------------------ | ------------------------------- |
+| -K\|--kafka     | Specify the target kakfa server address for gala-gopher to report collected  data (generally the IP of the management node). When this option is not  configured, the kafka server address uses localhost. | NO                              |
+| -p\|--pyroscope | Specify the pyroscope server address to which the flame graph is uploaded after gala-gopher turns on the flame graph function (used for docking  front-end interface display) (generally the IP of the management node).  When this option is not configured, the pyroscope server address uses  localhost. | NO                              |
+| -S\|--srcdir    | Use this option when deploying offline to specify the directory where gala-gopher and its dependent packages are located. | Required for offline deployment |
+| --k8s           | gala-gopher Specify to deploy gala-gopher in the k8s cluster in daemonset mode | NO                              |
+
+#### Management node: deploy gala-ops
+
+1. gala-ops container image download
+
+Before offline deployment, you need to download the gala-ops  (gala-anteater/gala-spider/gala-inference) container image tar package  on a machine that can connect to the external network. This tool  provides one-click full download of offline resource download scripts  and auxiliary scripts. After uploading the two scripts to the machine,  execute the following commands to complete the download of relevant  offline resources. The downloaded content will be stored in the  subdirectory gala_deploy_ops of the current directory.
+
+```
+sh download_offline_res.sh ops [os_arch]
+```
+
+Optional options:
+
+- os_arch: Specifies to download the container image of this architecture. When  this item is not configured, the current system architecture is used.  Supported architecture list: aarch64 x86_64
+
+2. One-click deployment of tools
+
+Upload all files and deployment scripts and auxiliary scripts in the  gala_deploy_ops directory to the target management node machine, and  execute the following commands to install, configure, and start the  gala-ops service. Use the -S option to specify the directory where the  container image tar package is located.
+
+```
+sh deploy.sh ops -K <kafka server address> -P <prometheus server address> -A <arangodb address> -S <gala-ops container image tar package directory>
+```
+
+Option details:
+
+| options          | Parameter Description                                        | Is it required?                 |
+| ---------------- | ------------------------------------------------------------ | ------------------------------- |
+| -K\|--kafka      | Specify the kakfa server address for gala-ops to read messages (generally the  IP of the management node). When this option is not configured, the  kafka server address uses localhost. | NO                              |
+| -P\|--prometheus | Specify the prometheus server address for gala-ops to read messages (generally  the IP of the management node). When this option is not configured, the  prometheus server address uses localhost. | NO                              |
+| -A\|--arangodb   | Specify the arangodb server address where gala-ops stores relationship graph  data (generally the IP of the management node). When this option is not  configured, the arangodb server address uses localhost. | NO                              |
+| -S\|--srcdir     | Use this option during offline deployment to specify the directory where the gala-ops container image tar package is located. | Required for offline deployment |
+
+#### Management node: deploy grafana
+
+1. grafana container image and dependent python library download
+
+Before offline deployment, you need to download the grafana container image on a machine that can connect to the external network. This tool provides  offline resource download scripts and auxiliary scripts for one-click  full download. After uploading the two scripts to the machine, execute  the following commands to complete the download of relevant offline  resources. The downloaded content will be stored in the subdirectory  gala_deploy_grafana of the current directory:
+
+```
+sh download_offline_res.sh grafana [os_arch]
+```
+
+Optional options:
+
+- os_arch: Specifies to download the container image of this architecture. When  this item is not configured, the current system architecture is used.  Supported architecture list: aarch64 x86_64
+
+2. One-click deployment of tools
+
+Upload all files, deployment scripts and auxiliary scripts under  gala_deploy_grafana to the target management node, and execute the  following command to complete the deployment. Grafana will run as a  container instance.
+
+```
+sh deploy.sh grafana -P <Prometheus server address> -p <pyroscope server address> -E <elasticsearch server address> -S <grafana installation package directory> --grafana_addr <grafana front end address> --grafana_addr_server_port <grafana address server port number>
+```
+
+Option details:
+
+| options                    | Parameter Description                                        | Is it required?                 |
+| -------------------------- | ------------------------------------------------------------ | ------------------------------- |
+| -P\|--prometheus           | Specify the prometheus data source address in grafana (generally the IP of the  management node). When this option is not configured, the prometheus  data source uses localhost. | NO                              |
+| -p\|--pyroscope            | Specify the pyroscope data source address for reading flame graphs in grafana  (generally the IP of the management node). When this option is not  configured, the pyroscope data source uses localhost. | NO                              |
+|                            | Specify the elasticsearch data source address (generally the IP of the  management node) for reading anomaly detection, topology map, and root  cause location results in grafana. When this option is not used, the  elasticsearch data source uses localhost | NO                              |
+|                            | Specify the grafana front-end address to facilitate external access to the  grafana front-end page. When this option is not used, the default value  is "http://localhost:3000" | NO                              |
+| --grafana_addr_server_port | There is a grafana address server in the deployed container, which is used to obtain the grafana front-end address externally through the http  interface. When this option is not used, the default value is 3010. For  example, if the user executes `curl -X GET localhost:3010` , the command returns the grafana front-end address - "http://localost:3000" | NO                              |
+| -S\|--srcdir               | Use this option when deploying offline to specify the directory where the grafana installation package is located. | Required for offline deployment |
+
+### Online deployment
+
+#### Get deployment script
+
+Downloading separate deployment scripts and auxiliary scripts does not require  downloading the entire tool. You can directly download them to the  machine to be deployed through the following commands:
+
+```
+wget https://gitee.com/openeuler/gala-docs/raw/master/deploy/deploy.sh --no-check-certificate
+wget https://gitee.com/openeuler/gala-docs/raw/master/deploy/comm.sh --no-check-certificate
+```
+
+#### Management node: deploy middleware
+
+Execute the following commands to install, configure, and start the  kafka/prometheus/elasticsearch/logstash/arangodb/pyroscope service. The  -K/-P/-E/-A/-p option supports separate deployment of corresponding  components. -P is used for configuration. The prometheus server grabs a  list of source addresses of messages (i.e., the production nodes where  gala-gopher is deployed), and each address is separated by an English  comma; due to dependencies, elasticsearch/logstash is uniformly  controlled and bound to the installation through the -E option.
+
+```
+sh deploy.sh middleware -K <deploy node management IP> -P <prometheus grab source address 1[,prometheus grab source address 2,prometheus grab source address 3,...]> -E <deploy node management IP> -A -p
+```
+
+#### Production node: deploy gala-gopher
+
+Use the following commands to install, configure, and start the gala-gopher service:
+
+1. Container mirroring method (applicable to single node)
+
+```
+sh deploy.sh gopher -K <kafka server address> -p <pyroscope server address>
+```
+
+2. K8S daemonset deployment (applicable to clusters)
+
+```
+sh deploy.sh gopher -K <kafka server address>  -p <pyroscope server address> --k8s
+```
+
+**Note: The daemonset method needs to be executed on the master node of the k8s cluster.**
+
+#### Management node: deploy gala-ops
+
+The gala-ops component supports rpm and container image deployment methods. You need to specify the kafka, prometheus, and arangodb server  addresses when deploying. If not specified, the addresses of these  middlewares use localhost by default.
+
+1. rpm mode (only supports openEuler 22.03 LTS SP1)
+
+```
+sh deploy.sh ops -K <kafka server address> -P <prometheus server address> -A <arangodb address>
+```
+
+2. Container image method:
+
+```
+sh deploy.sh ops -K <kafka server address> -P <prometheus server address> -A <arangodb address> --docker
+```
+
+#### Management node: deploy grafana
+
+Execute the following command to complete the front-end page deployment. Grafana will run as a container instance.
+
+```
+sh deploy.sh grafana -P <Prometheus server address> -E <es server address>
+```
+
+The gala-ops deployment demonstration video takes the openEuler 22.03  LTS version as an example to demonstrate the process of using the  deployment tool to complete the deployment of gala-gopher on the  generation node and the gala-ops component on the management node.
+
+After completing the above deployment actions, you can access  "http://[deployment node IP]:3000" through the browser and log in to  grafana to use A-Ops. The default login username and password are admin. The overall introduction video of A-Ops combines the grafana front-end  display page to demonstrate the overall function of A-Ops.
+
+# Project roadmap
+
+A-Ops mainly selected 8 main scenarios and implemented related solutions in  stages. gala-ops follows its scenario planning roadmap and defines its  own feature implementation plan. Please refer to the figure below for  related scenario roadmap and implementation features:
+
+![](./png/roadmap.png)
+
+# Feature introduction
+
+## Online application performance diagnosis
+
+### Feature background
+
+In a cloud environment, application performance is most affected by  environmental factors such as load and resources. Such factors cannot be simulated in the laboratory, so online positioning capabilities are  particularly important. There are two difficulties in application  performance diagnosis: 1) Failure to identify application performance  degradation ;2) Unable to determine the root cause of the problem.
+
+- Unable to identify application performance degradation
+
+  For CSP vendors, this issue is as important as locating the root cause of  the problem, because the services provided by CSP vendors have SLA  commitments and proactively identify cloud service SLI performance  degradation. For CSP vendors, problems can be discovered in advance and  customer complaints can be avoided. , passive operation and maintenance  is changed to active operation and maintenance.
+
+  We use common DCS scenario cases among CSP vendors to introduce why it is  difficult for CSP vendors to detect cloud service SLI performance  degradation.
+
+  > Distributed Cache Service (DCS) provides tenants with online distributed caching  capabilities. Common applications include Redis, Memcached, etc. It is  usually used to meet users' business requirements for high concurrency  and fast data access. Common usage scenarios include e-commerce. , video live broadcast, game applications, social APP, etc.
+
+  Currently, there are two common DCS SLI performance monitoring methods used by CSP vendors: 1) Dial-up test method to simulate tenant access; 2)  Performance management within DCS application software;
+
+  - Dial test method to simulate tenant access
+
+![](./png/DCS-1.png)
+
+The SLI of DCS obtained by dial test is actually different from the SLI  experienced by real tenants when accessing DCS. The differences include  differences in the network path, access method, and access frequency of  service access. This difference leads to DCS performance monitoring  distortion in this method. question.
+
+- Performance management within DCS application software
+
+![](./png/DCS-2.png)
+
+Direct performance management within DCS service applications (such as Redis)  to obtain application performance seems to be a good choice, but the  actual situation is unexpected. Tenants often complain that DCS service  SLA does not meet the standards, but application layer monitoring still  cannot find the problem. The reason is that the performance statistics  of the application layer do not cover the factors that affect  application performance at the system level, such as TCP packet  loss/congestion, network delay, block device I/O delay, progress  scheduling delay, etc.
+
+- Unable to determine root cause of problem
+
+  Still taking the DCS scenario as an example, all cloud services provided by  CSP vendors need to be accessed by tenants through the network. Network  factors are crucial to the performance of cloud services. In addition to network factors, the greatest impact on applications includes I/O  latency, scheduling latency, memory application latency, etc. These  problems currently mainly rely on OS diagnostic tools to achieve problem delimitation/location. But there are some problems with OS diagnostic  tools:
+
+  -  Tool fragmentation
+
+    OS diagnostic tools are available in seven countries and eight countries.  New tools emerge in an endless stream (BCC, Blktrace, iostat, netstat,  etc.). The use of tools depends on the experience of operation and  maintenance personnel to judge when, where and how to use the tools.  Operation and maintenance efficiency depends on People experience.
+
+  -  Online environment usage is limited
+
+    Most OS diagnostic tools cannot be resident in the system and rely on  capturing diagnostic data at the fault site. When faced with random  faults, the diagnostic tools have no way to start. In addition, when  faced with a temporary sys CPU surge scenario, the system will  temporarily fail. Diagnostic tools will be useless when logging in,  commands cannot be executed, etc. In addition, some tools have problems  such as requiring additional privileges and limited installation for use in online environments.
+
+### solution
+
+#### High-fidelity acquisition application performance SLI
+
+Google proposes the VALET method for the evaluation of cloud service SLI,  which comprehensively evaluates application performance from five  dimensions. We draw on its ideas and evaluate application performance  from two perspectives: throughput (capacity) and latency (other  dimensions may also be included in the evaluation scope later).
+
+![](./png/VALET.png)
+
+In order to improve versatility (avoiding strong language dependencies,  avoiding application adaptation SDK modifications, etc.), gala-gopher  provides a relatively universal application performance SLI collection  method. We collect application performance data from the OS kernel TCP  perspective (that is, theoretically This method applies to all TCP-based applications).
+
+- TCP layer collection application delay performance
+
+  The difficulty in collecting delay performance is how to reduce the error  effects on delay statistics caused by factors such as network  retransmission, interruption delay, and scheduling delay. Referring to  the figure below, gala-gopher will record the timestamp (TS1) of the  arrival of the business request (access request [3]) at the kernel soft  interrupt, and record the timestamp (TS2) of the application reading the business request at the system call. Wait for When the cloud service  applies Response, it will execute the system call Write (TS3). Response  will generate a TCP data stream and TCP_ACK (TS4) will be generated  after the TCP data stream reaches the Request requester. Through the  above four timestamps, we get:
+
+  Application latency performance SLI: TS4 - TS1 1
+
+  Application receive direction delay: TS2 - TS1 [2]
+
+  Application send direction delay: TS4 - TS3 [2]
+
+  Through this method, we can achieve most application latency performance SLI  and latency at different stages in the processing process (facilitating  problem delimitation).
+
+  ![](./png/tcp-1.png)
+
+[^1]: openEuler 22.03 SP1 version has been released. [^2]: openEuler version  23.09 is to be released. [^3]: It is assumed that when the cloud service application processes access requests, within a single TCP connection,  it is processed in first-in, first-out order. If this assumption does  not hold, the above delay collection may have errors.
+
+- TCP layer collection application throughput performance
+
+  The difficulty of throughput performance collection is to identify  short-term throughput decline. For example, in some scenarios, the 20ms  periodic sliding window does not move during TCP data transmission,  resulting in data transmission that usually completes in 1 to 3 seconds, and the performance deteriorates to more than 10 seconds to complete. .
+
+  To give a vivid example, TCP throughput monitoring is like highway  monitoring. It needs to continuously monitor whether there are idle  highway resources per unit time on the highway. The smaller the unit  time, the higher the monitoring accuracy.
+
+  The noise floor and accuracy of this kind of monitoring data observation  brings challenges. This part of the capability is planned to be launched in the innovative version of openEuler 23.09.
+
+  Note: The application throughput collected by gala-gopher in openEuler 22.03  LTS SP1 version still comes from the application itself rather than the  OS system level.
+
+#### Basic software low-level analysis
+
+According to the previous introduction, locating the root cause of the problem  cannot be separated from the observation at the OS system level. In view of the limitations of existing tools, gala-gopher locates the OS system background service and provides all-round observation capabilities of  the basic software. Based on eBPF technology, continuous, The low-noise  method is to collect basic software runtime data (mainly Metrics type  data). All collected performance Metrics data will carry application  (i.e. process/thread) tags, enabling drill-down observation of system  operating status from an application perspective.
+
+Example:
+
+```
+    {
+        table_name: "tcp_abn",   
+        entity_name: "tcp_link",   
+        fields:
+        (
+            {
+                description: "id of process",
+                type: "key",
+                name: "tgid",   -->
+            },
+            {
+                description: "role",
+                type: "key",
+                name: "role",   --> 
+            },
+            {
+                description: "client ip",
+                type: "key",
+                name: "client_ip",  --> tcp client IP
+            },
+            {
+                description: "server ip",
+                type: "key",
+                name: "server_ip",  --> tcp server IP
+            },
+            {
+                description: "client port",  --> 
+                type: "key",
+                name: "client_port",  
+            },
+            {
+                description: "server port",
+                type: "key",
+                name: "server_port",
+            },
+            {
+                description: "protocol",
+                type: "key",
+                name: "protocol",
+            },
+            {
+                description: "comm",
+                type: "label",
+                name: "comm",    --> 
+            },
+            {
+                description: "retrans packets",
+                type: "gauge",
+                name: "retran_packets",  --> 
+            },
+            {
+                description: "drops caused by backlog queue full",
+                type: "gauge",
+                name: "backlog_drops",
+            },
+            {
+                description: "sock drop counter",
+                type: "gauge",
+                name: "sk_drops",
+            },
+            {
+                description: "tcp lost counter",
+                type: "gauge",
+                name: "lost_out",
+            },
+            {
+                description: "tcp sacked out counter",
+                type: "gauge",
+                name: "sacked_out",
+            },
+            {
+                description: "drops caused by socket filter",
+                type: "gauge",
+                name: "filter_drops",
+            },
+            {
+                description: "counter of tcp link timeout",
+                type: "gauge",
+                name: "tmout_count",
+            },
+            .....
+        )
+    }
+```
+
+The scope of data observation includes network, I/O, memory, scheduling, etc. For details, please refer to here.
+
+![](./png/basic-analysis.png)
+
+Combine application performance SLI and basic software low-level data  observation to establish a large application performance model. The  former is KPI and the latter is feature quantity. Online problem  analysis is completed through relevant components in gala-ops to find  the contribution value to application performance degradation. The  largest feature amount (i.e. a certain basic software low-level Metrics)
+
+### Case presentation
+
+Databases are similar to DCS. They often encounter interference from I/O and  network factors, causing fluctuations in application performance. Below  we use openGauss as a demonstration case.
+
+Application Performance Diagnostics Video
+
+## System performance diagnostics
+
+### Feature background
+
+System performance diagnosis is mainly used to provide daily inspections for  system maintenance SRE, and provide diagnostic capabilities including  network (TCP), I/O and other performance degradation. Suitable for  tracing random problems, such as network and I/O performance  fluctuations, Socket queue overflow, DNS access failure, system call  failure, system call timeout, process scheduling timeout, etc.
+
+Supported system performance diagnostic types can be found here.
+
+### solution
+
+System performance diagnosis is divided into two categories:
+
+- System error/insufficient resources
+
+  This type of problem usually relies on some expert experience rules and user configuration thresholds to identify system problems.
+
+  The specific support scope is as follows:
+
+  - TCP related exception events: refer here. Specific contents include: tcp  OOM, TCP packet loss/retransmission, TCP 0 window, TCP link  establishment timeout, etc.
+  - Socket-related abnormal events: refer here. Specific contents include:  listening queue overflow, Accept queue overflow, Syn queue overflow,  active/passive link establishment failure, SYNACK retransmission, etc.
+  - Process-related exceptions: see here. Specific contents include: system call failure, DNS access failure, iowait timeout, BIO error, scheduling timeout, etc.
+  - I/O related exception events: refer here. Specific contents include:  Request timeout, Request error, insufficient disk space, insufficient  inode resources, etc.
+
+- System performance fluctuations
+
+  This type of problem usually cannot be judged simply through some rules and  thresholds to determine whether performance fluctuations occur.  Therefore, some AI algorithms are needed for real-time detection.  gala-ops establishes system performance KPIs and corresponding feature  quantities, conducts offline training + online learning on the collected data, and implements online anomaly detection. For specific principles, please refer here.
+
+  Abnormal events related to system performance fluctuations: refer here.  Specifically, they include TCP link establishment performance  fluctuations, TCP transmission latency performance fluctuations, system  I/O latency performance fluctuations, process I/O latency performance  fluctuations, and disk read and write latency performance fluctuations.
+
+### Case presentation
+
+System Performance Diagnostics Video
+
+### Interface introduction
+
+System performance diagnosis results can also be notified to the outside world in the form of kafka topics. The diagnosis results identify the  specific observation entities and the reasons for the exceptions.
+
+-  Example 1: Block observation entity exception in the host object:
+
+  ```
+  {
+    "Timestamp": 1586960586000000000,		
+    "event_id": "1586xxx_xxxx"			
+    "Attributes": {
+      "entity_id": "xx",					
+      "event_id": "1586xxx_xxxx",			
+      "event_type": "sys",				
+      "data": [....],     // optional
+      "duration": 30,     // optional
+      "occurred count": 6,// optional
+    },
+    "Resource": {
+      "metrics": "gala_gopher_block_err_code",	
+    },
+    "SeverityText": "WARN",				
+    "SeverityNumber": 13,					
+    "Body": "20200415T072306-0700 WARN Entity(xx)  IO errors occured. (Block %d:%d, COMM %s, PID %u, op: %s, datalen %u, err_code %d, scsi_err %d, scsi_tmout %d)."								
+  }
+  ```
+
+  After users subscribe to abnormal events through Kafka, they can manage them  in tabular form and present the management in the form of time periods,  as follows:
+
+  | Time              | Exception event ID | Observation entity ID | Metrics                    | description                                                  |
+  | ----------------- | ------------------ | --------------------- | -------------------------- | ------------------------------------------------------------ |
+  | 11:23:54 CST 2022 | 1586xxx_xxxx       | xxx_xxxx              | gala_gopher_block_err_code | 20200415T072306-0700 WARN Entity(xx)  IO errors occured. (Block %d:%d, COMM %s, PID %u, op:  %s, datalen %u, err_code %d, scsi_err %d, scsi_tmout %d). 20200415T072306-0700 WARN Entity(xx) IO errors occurred. (Block %d:%d, COMM %s, PID %u, op:  %s, datalen %u, err_code %d, scsi_err %d, scsi_tmout %d). |
+
+## System I/O full stack observation
+
+### Feature background
+
+Distributed storage (including block storage, object storage, etc.) is an important service provided by CSP vendors. Common distributed storage services  include EVS and OBS. Almost all CSP vendors have these cloud services,  and these cloud services are also provided by other cloud services.  Storage backend provider. Therefore, the operation and maintenance  efficiency of distributed storage will determine the operation and  maintenance efficiency of the entire system of CSP manufacturers.
+
+At the same time, the complex structure of the distributed storage system, the diversity of software sources, and the clustered and distributed  deployment of the system all bring challenges to the operation and  maintenance of this scenario. Specifically in: 
+
+- Different business teams use different diagnostic tools, and there is a lack of  data connection between the tools, resulting in low operation and  maintenance efficiency.
+- There is a lack of monitoring platform for cluster operation status from the  I/O perspective, and cluster I/O fault diagnosis capabilities are  insufficient.
+- Lack of historical problem tracing capabilities and insufficient random fault diagnosis capabilities.
+
+### solution
+
+Draw the distributed storage cluster topology in real time from the I/O data flow perspective, and observe the I/O data flow of the distributed  storage system from the full-stack I/O perspective.
+
+![](./png/io.png)
+
+ Remark:
+
+1. In view of the diversity of distributed storage system software sources,  ceph is used as an example to explain here. Different software solutions have different observation points. But the main idea is basically the  same.
+2. The system I/O full stack released by openEuler 22.03 SP1 is mainly for  ceph scenarios (without SPDK acceleration), and its distributed storage  scenarios will continue to be updated in subsequent updates.
+
+### Case presentation
+
+Distributed storage I/O full stack diagnosis video
+
+## Refined performance profiling
+
+### Feature background
+
+Users often encounter problems such as zombie processes, memory leaks, and  CPU surges during daily operation and maintenance. These problems are  manifested at the system level, but the root causes are often at the  application level. In order to allow system operation and maintenance  SRE to quickly delimit the scope of problems, gala-ops provides refined  performance profiling capabilities, which supports long-term, online  collection of system/application performance data, and can quickly  diagnose including CPU surges and memory leaks (or continuous growth) ,  system call exceptions, insufficient resources and other problems.
+
+### solution
+
+Through high-frequency sampling of system stack data through eBPF + system perf events, the operating status of the faulty on-site system can be highly restored; system stack data can also be sampled based on system  resource operation point Hooks to restore system resource usage in real  time.
+
+Covers most programming languages ​​(including C/C++, GO, Rust, Java, etc.)  and provides online and continuous full-stack information collection  capabilities.
+
+![](./png/stack_trace.png)
+
+![img](https://gitee.com/c2x9/gala-docs/raw/master/png/stack_trace.png)
+
+- Sampling load evaluation: Taking data sampled once every 10ms as an example, the instructions for one sampling logic are estimated to be 1W (a CPU with  10MS instructions can probably execute 1KW instructions), the number of  sampling instructions/the number of CPU executions per unit time =  1W/1KW 0.1%, So the sampling load is theoretically 0.1% (per core)
+- Data storage evaluation: The sampled data needs to be saved for a period of  time for periodic conversion into function symbols. Assuming the  sampling frequency is 10ms and the conversion period is 1min, then the  minimum sampling points to be retained are: 1min/10ms * single sampling  data. That is about 6000 sampling points for a single core. Amplified  evaluation shows that a single core has approximately 1.2W sampling  points.
+
+gala-ops supports the use of the Grafana graphical interface to help customers  better understand performance profiling results, including flame graphs  and timeline graphs.
+
+### Case presentation
+
+Refined performance profiling video
+
+## K8S Pod full stack observability and diagnosis
+
+### Function description
+
+The GALA project will fully support K8S scenario fault diagnosis, providing features including application drill-down analysis, microservice &  DB performance observability, cloud native network monitoring, cloud  native performance profiling, process performance diagnosis and other  features.
+
+- The K8S environment is easy to deploy: gala-gopher provides daemonset  deployment, and each Work Node deploys a gala-gopher instance;  gala-spider and gala-anteater are deployed to the K8S management Node in container mode.
+- Application drill-down analysis: Provides fault diagnosis capabilities for  sub-health problems in cloud-native scenarios, and completes problem  demarcation between applications and cloud platforms in minutes.
+  - Full-stack monitoring: Provides application-oriented refined monitoring  capabilities, covering cross-software stack observation capabilities  such as language runtime (JVM), GLIBC, system calls, kernel (TCP, I/O,  scheduling, etc.), and real-time viewing of the impact of system  resources on applications Impact.
+  - Full-link monitoring: Provides network flow topology (TCP, RPC) and software  deployment topology information. Based on this information, the system  3D topology is constructed to accurately view the resource range that  the application depends on and quickly identify the fault radius.
+  - GALA causal AI: Provides visual root cause derivation capabilities, and can be delimited to resource nodes in minutes.
+
+ ![](./png/k8s-monitor.png)
+
+- Microservice & DB performance observability: Provides non-intrusive microservice and DB access performance observability capabilities, including:
+
+- HTTP 1.x access performance is observable, including throughput, latency,  error rate, etc. It supports API refined observability and HTTP Trace  capabilities to facilitate viewing of abnormal HTTP request processes.
+- PGSQL access performance is observable, including throughput, latency, error  rate, etc. It supports refined observation capabilities based on SQL  access and slow SQL Trace capabilities, making it easy to view specific  SQL statements of slow SQL.
+
+ ![](./png/db-monitor.png)
+
+Cloud native application performance profiling: Provides non-intrusive,  zero-modification cross-stack profiling analysis tools, and can be  connected to the pyroscope industry's common UI front-end. Technical  features include:
+
+- Low noise floor: In benchmark test scenarios, the interference to applications is <2%.
+- Multi-language: Supports common development languages ​​C/C++, Go, Rust, and Java.
+- Multi-instance: Supports monitoring of multiple processes or containers at the same  time, and the UI front-end can comparatively analyze the cause of the  problem.
+- Fine-grained: Supports specifying profiling scope, including processes, containers, and Pods.
+- Multi-dimensional: Provides application profiling in different dimensions of OnCPU, OffCPU, and MemAlloc.
+
+![](./png/Pyroscope-UI.png)
+
+- Cloud native network monitoring: For K8S scenarios, it provides TCP, Socket,  and DNS monitoring capabilities, and has more refined network monitoring capabilities.
+
+![](./png/network-monitor.png)
+
+- Process performance diagnosis: Middleware for cloud native scenarios (such as  MySQL, Redis, etc.) provides process-level performance problem diagnosis capabilities, while monitoring process performance KPIs and  process-related system layer Metrics (such as I/O, memory, TCP, etc.),  complete Process performance KPI anomaly detection and system layer  Metrics that affect the KPI (reasons that affect process performance).
+
+### Application scenarios
+
+![](./png/k8s-deploy.png)
+
+Deployment method: gala-gopher provides daemonset deployment, and each Work Node  deploys a gala-gopher instance; gala-spider and gala-anteater are  deployed to the K8S management Node in container mode.
+
+The relevant usage methods are as follows:
+
+ gala-gopher daemonset deployment introduction
+
+Introduction to REST configuration
+
+# Features to be released
+
+1. System hidden danger inspection
+2. Thread performance profiling (solve thread deadlocks, I/O bottlenecks and other difficult problems)
+
+# Q&A
+
+## How to use quickly
+
+How to quickly deploy the system?
+
+The community provides two types of quick deployment methods, 1) online  deployment; 2) offline deployment; the former requires the user's  installation environment to access the openEuler community; the latter  can require the user to download the software and copy it to the  installation environment;
+
+## Version matching
+
+1. What OS versions does gala-gopher support?
+
+   - openEuler 22.03 SP1 and subsequent LTS versions are the versions officially recommended for customers;
+
+   - LTS versions before openEuler 22.03 SP1 will also receive technical support from the openEuler community, but the community does not recommend  customers to use them;
+   - Commercial OSs of the openEuler series (such as Kirin V10) will also receive  technical support from the openEuler community, but commercial OS  manufacturers have not yet provided commercial maintenance for  gala-gopher;
+   - Technically, gala-gopher can be installed and deployed on non-openEuler series OS  (such as SUSE 12, CentOS, etc.), but in principle, community technical  support is not available;
+
+2. Which kernel versions does gala-gopher's observation capabilities support?
+
+   | Kernel version | Observation capability range                                 |
+   | -------------- | ------------------------------------------------------------ |
+   | 4.12           | Online performance flame graph, TCP, I/O, application L7 layer traffic, process, DNS |
+   | 4.18           | Online performance flame graph, TCP, I/O, application L7 layer traffic,  process, DNS, Redis (Server side) latency performance, PG DB (Server  side) latency performance, openGauss (Server side) latency performance |
+   | 4.19           | Online performance flame graph, TCP, I/O, application L7 layer traffic,  process, DNS, Redis (Server side) latency performance, PG DB (Server  side) latency performance, openGauss (Server side) latency performance |
+   | 5.10           | Online performance flame graph, TCP, I/O, application L7 layer traffic,  process, DNS, Redis (Server side) latency performance, PG DB (Server  side) latency performance, openGauss (Server side) latency performance |
+
+3. gDoes gala-gopher support cross-kernel version compatibility?
+
+   - In the openEuler 22.03 SP1 version, gala-gopher does not support cross-kernel version compatibility;
+   - It is planned that in the openEuler 22.03 SP3 version, gala-gopher will  support cross-Release version compatibility; (that is, within the 5.10  kernel version range, different release versions can use the same  gala-gopher component);
+   - It is planned that in openEuler 23.03 SP1 version, gala-gopher will  support cross-Release version compatibility in low kernel versions  (4.18/4.19);
+   - It is planned that in 2024, gala-gopher will support compatibility across  major kernel versions (for example, two kernel versions 5.10/4.19 can  use the same gala-gopher software version);
+
+## Performance Testing
+
+1. gala-gopher noise floor data?
+
+   Test Conditions: 
+
+   - Hardware environment: x86_64 architecture, 16 cores, 8G memory, virtual machine
+   - Software environment: openEuler-22.03-LTS operating system, 5.10.0 kernel
+   - Application observation scope: A total of 8 processes were observed, including:
+     - kafka server and benchmark client with background traffic (1000records/sec).
+     - Redis server and benchmark client with background traffic (30,000 requests/sec).
+
+   The test results are as follows:
+
+   | Test items                                                   | Single-core CPU usage (reported every 5 seconds) |
+   | ------------------------------------------------------------ | ------------------------------------------------ |
+   | Start the systeminfo probe alone                             | Average 1.5%                                     |
+   | Start the proc probe individually                            | Average <1%                                      |
+   | Start tcp probe individually                                 | Average <1%                                      |
+   | Start io probe separately                                    | Average 2%                                       |
+   | Start endpoint probes individually                           | Average <1%                                      |
+   | Start jvm probe individually                                 | Average 2%                                       |
+   | Start the stackprobe probe individually                      | Average <1%, highest to 1.5% (interval 30s)      |
+   | Start systeminfo, proc, tcp, io, endpoint, jvm, stackprobe probes | Average 5%                                       |
+
+2. How many resources does the gala system require?
+
+   | component            | Deployment location | Resource requirements                                        |
+   | -------------------- | ------------------- | ------------------------------------------------------------ |
+   | gala-gopher          | production node     | 0.2 core, 100M memory (only TCP and I/O collection capabilities are enabled) |
+   | gala-spider          | Management node     |                                                              |
+   | gala-anteater        | Management node     |                                                              |
+   | prometheus           | Management node     |                                                              |
+   | Grafana              | Management node     |                                                              |
+   | kafka                | Management node     |                                                              |
+   | elasticsearch        | Management node     |                                                              |
+   | logstash             | Management node     |                                                              |
+   | arangodb             | Management node     | Constraints: Only supports X86 architecture                  |
+   | pyroscope (optional) | Management node     |                                                              |
+
+## Supported languages & protocol range
+
+| agreement    | support                                                      |
+| ------------ | ------------------------------------------------------------ |
+| HTTP1.X      | language：C/C++、Java、Go（TO BE）、Rust（TO BE）； encryption library：openSSL、JSSE、GoSSL（TO BE）、Rustls（TO BE） |
+| PostgreSQL   | language：C/C++、Java、Go（TO BE）、Rust（TO BE）； encryption library：openSSL、JSSE、GoSSL（TO BE）、Rustls（TO BE） |
+| MySQL        | language：C/C++、Java、Go（TO BE）、Rust（TO BE）； encryption library：openSSL、JSSE、GoSSL（TO BE）、Rustls（TO BE） |
+| Dubbo        | language：C/C++、Java、Go（TO BE）、Rust（TO BE）； encryption library：openSSL、JSSE、GoSSL（TO BE）、Rustls（TO BE） |
+| Redis Radish | TO BE                                                        |
+| Kafka        | TO BE                                                        |
+| HTTP 2.0     | TO BE                                                        |
+| MongoDB      | TO BE                                                        |
+
+## Support virtual machine & container & K8S environment
+
+1. Supports multiple container runtimes: including containerd, docker, and isula;  can monitor container instances for these three container runtime  scenarios.
+
+2. Does gala-gopher support K8S deployment and supervision?
+
+   ggala-gopher supports deployment in the form of daemonset in the K8S environment.  For related daemonset yaml, container image Dockerfile, and container  startup entry script, please refer to:
+
+   https://gitee.com/openeuler/gala-gopher/blob/dev/k8s/daemonset.yaml.tmpl https://gitee.com/openeuler/gala-gopher/blob/dev/build/Dockerfile_2003_sp1_x86_64 https://gitee.com/openeuler/gala-gopher/blob/dev/build/Dockerfile_2003_sp1_aarch64 https://gitee.com/openeuler/gala-gopher/blob/dev/build/entrypoint.sh
+
+3. The collected data and reported events will carry system context tags as follows:
+
+   Node label: System ID (unique within the cluster), management IP.
+
+   Device label: Device Name (such as network card, disk).
+
+   Process label: process ID, process name, cmdline.
+
+   Network labels: TCP client/server ip, server port, role label.
+
+   Container labels: container ID, container name.
+
+   PODlabels：POD ID，POD IP，Pod Name，Pod Namespace labels。
+
+   User-defined label: The label set by the user when the collection task is issued.  User-defined tags are set through the dynamic configuration interface  provided by gala-gopher. For details, see the link: Configuring probe  extension tags.
+
+4. Supports automatic expansion of public label information such as containers and PODs for process-level/container-level metrics.
+
+   The gala-gopher framework side will identify whether it is a process-level  or container-level metric based on the metric data reported by the  probe, and will supplement the process-level/container-level metric with container label and POD label information. The probe side does not need to collect these additional information separately. Label.
+
+   - Process-level metric: If the metric data collected by the probe includes the `tgid` tag, the metric is a process-level metric. For process-level metrics,  the gala-gopher framework will supplement container public labels  (including container ID, container name) and POD public labels  (including POD ID, Pod Name, Pod Namespace).
+   - Container-level metric: If the metric data collected by the probe includes the `container_id` tag, the metric is a container-level metric. For container-level  metrics, the gala-gopher framework will supplement container public  labels (including container name) and POD public labels (including POD  ID, Pod Name, and Pod Namespace).
+
+## Java environment support level
+
+Comprehensive support for Java environment observability capabilities, including  support for Java application performance profiling, network access for  Java encrypted applications, and JVM data collection.
+
+## How to build a system topology
+
+Topology construction principle: gala-gopher will provide L4 layer network flow, load sharing flow, L7 layer network flow, software deployment topology  and other information, and build the system 3D topology based on this  information.
+
+![](./png/topo_theory.png)
+
+System topology purpose: Mainly used for gala causal AI to derive root cause  location, complete fault source tracing based on TCP/RPC topology, and  complete application-oriented drill-down fault root cause location based on deployment topology.
+
+- Key question 1: How to solve the problem that the network flow topology caused by network NAT cannot be established?
+
+  The TCP/RPC flow topology information (IP/Port information before NAT)  collected by gala-gopher will be NAT converted based on the linux  conntrack table information, and the network flow topology information  after NAT will be sent.
+
+- Key question 2: How to establish the network flow topology of middleware?
+  gala-gopher provides load balancing flow monitoring capabilities for network load  balancing middleware such as Nginx/Haproxy, and can quickly complete  load sharing topology construction based on TCP information;
+
+  gala-gopher provides message body (topic) flow monitoring capabilities for message  bus middleware such as kafka, and can quickly build a message producer  and consumer topology.
+
+- Key question 3: How to establish K8S cluster topology?
+
+  To be added.
+
+## Middleware that gala depends on
+
+| middleware    | function                                                     | Substitutability analysis                                    |
+| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| prometheus    | Store gala-gopher metrics data in a time-series manner, connect to gala-ops  for data processing, and connect to grafana page display | It is necessary and cannot be replaced when using the topology map,  anomaly detection, and root cause location functions. It is not  necessary if you only need gala-gopher metrics data. You can modify the  metrics output mode in the gala-gopher configuration file to logs and  obtain it from the local directory /var/log/gala-gopher/metrics. |
+| kafka         | Store gala-gopher sub-health inspection, observation object metadata, anomaly detection output, root cause location results and other data for users  or gala-ops internal components to subscribe and obtain | It is necessary and cannot be replaced when using the topology map,  anomaly detection, and root cause location functions. It is not  necessary to use the gala-gopher sub-health inspection function. You can change the event output mode in the gala-gopher configuration file to  logs and obtain it from the local directory /var/log/gala-gopher/event. |
+| elasticsearch | Store gala-gopher sub-health inspection, anomaly detection output, root cause location results, topology map data and display them to the grafana  front end | It is necessary and irreplaceable to display sub-health inspection,  anomaly detection output, root cause location results, and topology map  data on the grafana page. |
+| logstash      | Preprocess kafka messages and store them in elasticsearch    | It is necessary and irreplaceable to display sub-health inspection,  anomaly detection output, root cause location results, and topology map  data on the grafana page. |
+| arangodb      | Store real-time topology data generated by gala-spider       | It is required and cannot be replaced when using the topology map function. |
+| pyroscope     | The gala-gopher flame graph data is stored in a time-series manner. The  built-in front-end page provides functions such as real-time preview,  filtering, and horizontal comparison of the flame graph, and is  connected to the grafana page to display the flame graph. | Not required, the flame graph file can be obtained directly from the local directory /var/log/gala-gopher/stackstrace |
+
+## How to use performance flame graphs
+
+1. Performance flame graph startup command example (basic):
+
+```
+curl -X PUT http://localhost:9999/flamegraph -d json='{ "cmd": {"probe": ["oncpu"] }, "snoopers": {"proc_name": [{ "comm": "cadvisor"}] }, "state": "running"}'
+```
+
+The above command is the simplest flame graph probe startup command. This  probe uses default parameters to sample the CPU usage of the cadvisor  process, and can periodically generate a flame graph in svg format  locally. The generated flame graph file is opened as shown below. In the picture, you can see the go and c call stacks of the cadvisor process.
+
+1.  Performance flame graph startup command example (advanced):
+
+A more complete startup command example is as follows. Customized  configuration of the flame graph probe can be achieved by manually  setting various parameters. For complete configurable parameters of the  flamegraph probe, see Probe Operation Parameters.
+
+```
+curl -X PUT http://localhost:9999/flamegraph -d json='{ "cmd": {  "check_cmd": "",  "probe": ["oncpu", "offcpu", "mem"] }, "snoopers": {  "proc_name": [{   "comm": "cadvisor",   "cmdline": "",   "debugging_dir": ""  }, {   "comm": "java",   "cmdline": "",   "debugging_dir": ""  }] }, "params": {  "perf_sample_period": 100,  "svg_period": 300,  "svg_dir": "/var/log/gala-gopher/stacktrace",  "flame_dir": "/var/log/gala-gopher/flamegraph",  "pyroscope_server": "localhost:4040",  "multi_instance": 1,  "native_stack": 0 }, "state": "running"}'
+```
+
+The above command can start a flame graph probe, specify a frequency of  100ms to sample the cadvisor and java processes, and generate three svg  format flame graphs locally every 300s: cpu flame graph, offcpu flame  graph, memory flame graph, at the same time The above flame graph data  is reported to the pyroscope server through a multi-process instance.  You can view it through pyroscope in real time, or you can configure the pyroscope data source in grafana, and then view the flame graph on  grafana.
+
+1. Example of how to view performance flame graph:
+
+Once you start the flame graph with the above command, you can view the  flame graph in real time with different filtering options. The picture  below is an example of viewing the memory flame graph of a k8s Pod in a  certain Java business.
+
+You can select machine_id at (1) to view the process flame graph on different hosts.
+
+You can select the process ID at (2) to view the flame graphs of different processes.
+
+You can select the flame graph type (oncpu/offcpu/mem) at (3) to view different types of flame graphs.
+
+(6) is the label of the process, the format is [pid]comm. If it is a  process in a k8s container, there will be [Pod]name and [Con]name  labels, see (4) (5).
+
+The time period of the flame graph can be selected at (7).
+
+1. Use performance flame graphs to locate performance issues:
+
+The display form of the flame chart can also be flexibly configured  according to business needs. For example, when a cloud service releases a grayscale version, both old and new versions of containers are deployed in the environment. By viewing and comparing flame graphs of container  instances of different versions at the same time, you can quickly  discover operating differences between versions and quickly locate  performance issues.
+
+In the following practice, two versions of Kafka client containers, old  and new, were deployed on a host. The test found that the CPU usage of  the new version of the container was slightly higher than that of the  old version.
+
+So configure the flame graph display of these two containers in grafana at the same time, and select the time period that needs to be positioned.  It can be clearly seen from the figure that the new version (right) has  more calls to String.format than the old version (left). This indicates  that the new version of the container is due to the addition of  serialization operations in the code, which leads to an increase in CPU  usage. .
+
+![](./png/compare_fg.png)
+
+## How to use to diagnose network problems
+
+Reference here
+
+## How to Diagnose I/O Problems
+
+Reference here
+
+## How to diagnose JAVA OOM problems
+
+To be added 
+
+# Partner
+
+![](./png/partner.png)
-- 
Gitee