# MsErrAnalyzer

**Repository Path**: adrian0711/ms-err-analyzer

## Basic Information

- **Project Name**: MsErrAnalyzer
- **Description**: MindSpore/MindFormers分布式日志分析工具
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 3
- **Forks**: 1
- **Created**: 2024-11-07
- **Last Updated**: 2024-11-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Mindformers Distributed Log Analyzer

[项目gitee链接](https://gitee.com/adrian0711/ms-err-analyzer)

## Introduction (简介)

This tool is designed to analyze logs generated during the execution of distributed training tasks on the Mindformers framework. It provides functionalities to detect and report errors, analyze worker logs, and examine scheduler issues, making it easier to debug distributed model training.

本工具用于分析在 **Mindformers** 框架上执行分布式训练任务时生成的日志的报错原因。它提供了统计错误信息、分析 worker 日志以及 scheduler 报错的功能，简化分布式日志定位。

---

## Features (功能)

1. **Minority Error Detection**: Identifies errors reported by ≤10% or no more than 8 workers, categorizing them as minority issues.

   **少数派报错分析**：识别由 ≤10% 或不超过 8 个 worker 报告的错误，称为少数派错误。这类错误常为性能问题的根因节点。

2. **Scheduler Log Analysis**: Extracts and analyzes warnings and errors from scheduler logs, especially those related to heartbeat loss, and correlates them with relevant worker logs.

   **Scheduler 首断连节点分析**：提取和分析 scheduler 日志中的心跳断连 WARNING 和 ERROR，并找到相应的 worker 节点日志。

3. **Timestamp Sorting Analysis**: Sorts and reports the earliest error lines across all worker logs based on timestamps. It can perform timestamp calibration based on the last shared training step (enabled by default).

   **时间戳排序首报错分析**：根据时间戳对所有 worker 日志中的错误行进行排序，并找出最早出现的错误。默认情况下，工具会基于最后一个共同的训练步长对时间戳进行校准。

4. **Last Line Analysis**: Analyzes the last lines of worker logs, groups similar logs, and identifies unique or common messages. The last lines of workers that differ from others are often related to the root cause.

   **最后一行日志分析**：分析 worker 日志的最后一行，分组相似的日志，并识别少数/多数的统计信息。最后一行与其他 worker 不一致的 worker 往往与根因相关。

5. **Worker IP Analysis**: Lists all worker numbers and their corresponding TCP IP addresses, and counts the number of times each scheduler IP appears in worker logs (only when `detail_level` is set to 2).

   **Worker IP 分析**：列出所有 worker 号和对应的 TCP IP 地址，并统计每个 scheduler IP 在 worker 日志中出现的次数（仅当 `detail_level` 设置为 2 时）。

---

## Usage (使用说明)

### Command-Line Arguments (命令行参数)

- `-d`, `--log_base_dir`: Directory containing worker logs (`worker_*.log` or `rank_*.log`) and the scheduler log (`scheduler.log` or `sched.log`). Default is the current directory (`./`).

  `-d`, `--log_base_dir`: 包含 worker 日志（`worker_*.log` 或 `rank_*.log`）和 scheduler 日志（`scheduler.log` 或 `sched.log`）的目录。默认值为当前目录 (`./`)。

- `-o`, `--output_dir`: Directory where the analysis report (`log_report.txt`) and all error logs (`all_errors.txt`) will be saved. Default is `./analyze_output`.

  `-o`, `--output_dir`: 保存分析报告（`log_report.txt`）和所有错误日志（`all_errors.txt`）的目录。默认值为 `./analyze_output`。

- `-n`, `--report_filename`: Name of the output report file. Default is `log_report.txt`.

  `-n`, `--report_filename`: 输出报告文件的名称。默认值为 `log_report.txt`。

- `-l`, `--detail_level`: Integer value to control how detailed the report is (default: 1). Choices are 0 (less detail), 1 (default), 2 (more detail).

  `-l`, `--detail_level`: 整数值，控制报告的详细程度（默认值为 1）。可选值为 0（较少细节）、1（默认）、2（更多细节）。

- `-f`, `--not_use_fuzz`: Flag to **not** use `rapidfuzz` for token comparison. If set, the script falls back to Python’s standard library method. Default is False (i.e., `rapidfuzz` is used if installed).

  `-f`, `--not_use_fuzz`: 禁用 `rapidfuzz` 进行 token 比较的标志。如果设置，将使用 Python 标准库方法进行比较。默认值为 False（即默认使用 `rapidfuzz`，如果安装了的话）。

- `-s`, `--skip_steps`: List of steps to skip. For example, `1` to skip rare error analysis, `2` to skip scheduler analysis, `3` for timestamp sorting, `4` for last line analysis, and `5` for worker IP analysis.

  `-s`, `--skip_steps`: 要跳过的步骤列表。例如，`1` 代表跳过罕见错误分析，`2` 代表跳过 scheduler 首断连分析，`3` 代表跳过时间戳排序分析，`4` 代表跳过最后一行分析，`5` 代表跳过 worker IP 分析。

- `-c`, `--calibrate_time`: Disable timestamp calibration based on the last shared training step in the timestamp sorting analysis. Calibration is enabled by default; specifying `-c` will disable this functionality.

  `-c`, `--calibrate_time`: 禁用在时间戳排序分析步骤中基于最后一个共同训练步长的时间戳校准。默认情况下启用校准；指定 `-c` 将禁用此功能。

- `-e`, `--throw_exception`: By default, exceptions are caught and written to the report, and the script continues execution. Specifying `-e` will cause the script to raise exceptions and stop execution upon encountering an error.

  `-e`, `--throw_exception`: 默认情况下，异常会被捕获并写入报告，脚本继续执行。指定 `-e` 将使脚本在遇到错误时抛出异常并停止执行。

---

### Example Command (示例命令)

1. **Basic Usage (基础用法)**:
   Place the script in the same directory as the `worker_*.log` or `rank_*.log` files and run:

   将脚本放置在 `worker_*.log` 或 `rank_*.log` 文件所在的同级目录下，执行：

   ```bash
   python analyze_dist.py
   ```

2. **Full Usage (全量用法)**:
   Specify the log base directory, output directory, and other parameters:

   指定日志目录、输出目录及其他参数：

   ```bash
   python analyze_dist.py -d /path/to/logs -o /path/to/output -n report.txt -l 2 -s 2 5
   ```

   This command sets the log directory to `/path/to/logs`, output directory to `/path/to/output`, report filename to `report.txt`, detail level to `2`, and skips steps 2 (scheduler analysis) and 5 (worker IP analysis).

   该命令将日志目录设置为 `/path/to/logs`，输出目录设置为 `/path/to/output`，报告文件名为 `report.txt`，详细程度为 `2`，并跳过步骤 2（scheduler 分析）和步骤 5（worker IP 分析）。

---

## Output (输出)

The tool generates a report file with the following sections:

该工具生成的报告文件包含以下部分：

1. **Rare Error Analysis (罕见错误分析)**:
   - Lists rare errors found in the worker logs, identifying errors reported by ≤10% or fewer than 8 workers.

     列出 worker 日志中发现的罕见错误，识别由 ≤10% 或少于 8 个 worker 报告的错误。

2. **Scheduler Log Analysis (Scheduler 日志分析)**:
   - Displays warnings and errors from the scheduler log, and correlates them with the corresponding worker logs.

     显示 scheduler 日志中的警告和错误，并关联相应的 worker 日志。

3. **Timestamp Sorting Analysis (时间戳排序分析)**:
   - Reports the earliest error lines across all worker logs, sorted by timestamps. By default, the tool performs timestamp calibration based on the last shared training step, which can be disabled with `-c`.

     报告所有 worker 日志中最早的错误行，并根据时间戳进行排序。默认情况下，工具会基于最后一个共同的训练步长对时间戳进行校准，可以通过 `-c` 禁用。

4. **Last Line Analysis (最后一行分析)**:
   - Groups the last lines of worker logs, highlighting unique and common patterns among workers.

     统计 worker 节点日志的最后一行，突出显示小众最后一行和大众最后一行。

5. **Worker IP Analysis (Worker IP 分析)**:
   - Lists all worker numbers and their corresponding TCP IP addresses, and counts the number of times each scheduler IP appears in worker logs (only when `detail_level` is set to 2).

     列出所有 worker 号和对应的 TCP IP 地址，并统计每个 scheduler IP 在 worker 日志中出现的次数（仅当 `detail_level` 设置为 2 时）。

Additionally, the tool saves all error lines from the worker logs into a file named `all_errors.txt` in the output directory.

此外，工具将所有 worker 日志中的错误行保存到输出目录中的 `all_errors.txt` 文件中。

---

## Installation (安装)

To install the required dependency `rapidfuzz` (for improved performance in string similarity calculations):

安装所需的依赖项 `rapidfuzz`（用于提高字符串相似度计算的性能）：

```bash
pip install rapidfuzz
```

---

**WARNING**: If `rapidfuzz` is not installed, the script will use Python's built-in methods for similarity calculation, which may have lower performance and accuracy.

**警告**：如果未安装 `rapidfuzz`，脚本将使用 Python 自带的方法进行相似度计算，可能会导致性能和准确度下降。

---

## FAQ (常见问题)

### What is the purpose of `detail_level`? (`detail_level` 参数的作用是什么？)

- The `detail_level` parameter controls the amount of detail in the report. A higher value includes more detailed information.

  `detail_level` 参数控制报告的详细程度。较高的值会包含更详细的信息。

- Possible values:
  - `0`: Less detail.
  - `1`: Default level of detail.
  - `2`: More detailed report (includes Worker IP Analysis).

  可选值：
  - `0`：较少细节。
  - `1`：默认详细程度。
  - `2`：更详细的报告（包括 Worker IP 分析）。

### How are rare errors defined? (罕见错误是如何定义的？)

- Rare errors are those reported by 10% or fewer workers, or by no more than 8 workers. These errors are often root causes, especially in distributed systems where time synchronization might be off.

  罕见错误是由 10% 或更少的 worker，或不超过 8 个 worker 报告的错误。这种错误常为根因，尤其在分布式系统时间不同步的情况下。

### What does the `--calibrate_time` option do?

- By default, the tool calibrates timestamps across worker logs based on the last shared training step, which helps in aligning events across different workers. Specifying `--calibrate_time` disables this calibration.

  默认情况下，工具会基于最后一个共同的训练步长对各个 worker 日志的时间戳进行校准，以帮助对齐不同 worker 上的事件。指定 `--calibrate_time` 会禁用此校准。

### What happens if I don't install `rapidfuzz`?

- If `rapidfuzz` is not installed, the script will use Python's standard library methods for similarity calculation. While this allows the script to run, performance and accuracy in grouping similar error messages may decrease.

  如果未安装 `rapidfuzz`，脚本将使用 Python 标准库的方法进行相似度计算。虽然脚本仍可运行，但在分组相似错误消息的性能和准确度可能会下降。

---

## License (许可证)

This project is licensed under the Apache License 2.0.

本项目使用 Apache 许可证 2.0。

---

Feel free to modify the repository and enhance the tool for your distributed log analysis needs!

欢迎根据您的分布式日志分析需求修改仓库并增强工具功能！

---