diff --git a/tb_plugins/profiling/README.md b/tb_plugins/profiling/README.md index 0353b1b4df0003778aebe890bf7a17e4d556c000..3a18f4c6239f353c10362c9e0ba5aae052cb2c07 100644 --- a/tb_plugins/profiling/README.md +++ b/tb_plugins/profiling/README.md @@ -15,7 +15,7 @@ The PyTorch Profiler TensorBoard plugin provides powerful and intuitive visualiz ## Libkineto Libkineto is an in-process profiling library integrated with the PyTorch Profiler. Please refer to the [README](libkineto/README.md) file in the `libkineto` folder as well as documentation on the [new PyTorch Profiler API](https://pytorch.org/docs/master/profiler.html). -## PyTorch TensorBoard Profiler +## PyTorch TensorBoard Profiler NPU Plugin The goal of the PyTorch TensorBoard Profiler is to provide a seamless and intuitive end-to-end profiling experience, including straightforward collection from PyTorch and insightful visualizations and recommendations in the TensorBoard UI. Please refer to the [README](tb_plugin/README.md) file in the `tb_plugin` folder. diff --git a/tb_plugins/profiling/tb_plugin/README.md b/tb_plugins/profiling/tb_plugin/README.md index e3c00875d5df3d43b541ea0ce35faae6d7da58c9..91c11d7b800549bd6eb5e84aec14cce94de067c8 100644 --- a/tb_plugins/profiling/tb_plugin/README.md +++ b/tb_plugins/profiling/tb_plugin/README.md @@ -1,478 +1,199 @@ -# PyTorch Profiler TensorBoard Plugin +# PyTorch Profiler TensorBoard NPU Plugin -This is a TensorBoard Plugin that provides visualization of PyTorch profiling. -It can parse, process and visualize the PyTorch Profiler's dumped profiling result, -and give optimization recommendations. +此工具是PyTorch profiling数据可视化的TensorBoard的插件。 +它支持将Ascend平台采集、解析的Pytorch Profiling数据可视化呈现,也兼容GPU数据采集、解析可视化。 -### Quick Installation Instructions +### 快速安装说明 +1. 插件方式安装 -* Install from pypi +* 安装相关依赖: + pandas >= 1.0.0 ,tensorboard >= 1.15, != 2.1.0 +* 使用依赖: + torch >= 1.8, + torchvision >= 0.8 - `pip install torch-tb-profiler` +* 插件形式为whl包,使用指令安装 -* Or you can install from source + `pip install torch-tb-profiler_npu_0.4.0_py3_none_any.whl` - Clone the git repository: +2. 从源代码安装 - `git clone https://github.com/pytorch/kineto.git` +* 从仓库下载源码: - Navigate to the `kineto/tb_plugin` directory. + `git clone https://gitee.com/ascend/amtt` - Install with command: +* 进入目录 `/tb_plugins/profiling/tb_plugin` 下. - `pip install .` +* 执行安装命令: -* Build the wheel + `pip install .` +* 构建whl包 - `python setup.py build_fe sdist bdist_wheel` \ - **_Note_**: the build_fe step need setup yarn and Node.js + 注意: build_fe步骤需要setup yarn和Node.js - `python setup.py sdist bdist_wheel` -### Quick Start Instructions - -* Prepare profiling data - - We have prepared some sample profiling data at [kineto/tb_plugin/samples](./samples) - You can download it directly. - Or you can generate these profiling samples yourself by running - [kineto/tb_plugin/examples/resnet50_profiler_api.py](./examples/resnet50_profiler_api.py). - Also you can learn how to profile your model and generate profiling data from [PyTorch Profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html?highlight=tensorboard). - - Note: The recommended way to produce profiling data is assigning `torch.profiler.tensorboard_trace_handler` - to `on_trace_ready` on creation of `torch.profiler.profile`. - -* Start TensorBoard - - Specify the profiling data folder to `logdir` in TensorBoard. If you use the above samples data, start TensorBoard with: + 在 `/tb_plugins/profiling/tb_plugin/dist` 目录下取出whl包,使用方式1进行安装 +### + +* 准备profiling数据 + + 需要在读取的目录下放置指定格式的profiling数据。格式为包含3个层级的目录结构:runs层级为最外层目录(我们将一个完整的Profiling数据视为一个runs\ + 进行可视化处理),其子目录为worker_span层级(命名格式为{worker}_{span}),下一层级为规定命名的ASCEND_PROFILER_OUTPUT目录,此目录中包含\ + 此插件加载展示的数据文件,如trace_view.json、kernel_details.csv、operator_details.csv等。目录结构如下: +* E.g. there are 2 runs: run1, run2 \ + `run1` \ + `--[worker1]_[span1]` \ + `----ASCEND_PROFILER_OUTPUT` \ + `------trace_view.json` \ + `------kernel_details.csv` \ + `--[worker2]_[span1]` \ + `----ASCEND_PROFILER_OUTPUT` \ + `------trace_view.json` \ + `------operator_details.csv` \ + `run2` \ + `--[worker1]_[span1]` \ + `----ASCEND_PROFILER_OUTPUT` \ + `------memory_record.csv` \ + `------operator_memory.csv` + +* 启动TensorBoard `tensorboard --logdir=./samples` - If your web browser is not in the same machine that you start TensorBoard, - you can add `--bind_all` option, such as: + 如果网络浏览器与启动TensorBoard的机器不在同一台机器上,则需要在尾部加上`--bind_all`命令,如: `tensorboard --logdir=./samples --bind_all` - Note: Make sure the default port 6006 is open to the browser's host. - -* Open TensorBoard in Chrome browser - - Open URL `http://localhost:6006` in the browser. - If you use `--bind_all` in tensorboard start command, the hostname may not be 'localhost'. You may find it in the log printed after the cmd. - -* Navigate to the PYTORCH_PROFILER tab - - If the files under `--logdir` are too big or too many, - please wait a while and refresh the browser to check latest loaded result. - -* Loading profiling data from the cloud - * AWS S3 (S3://) - - Install `boto3`. Set environment variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`. Optionally, `S3_ENDPOINT` can be set as well.\ - For minio, the S3 url should start with the bucket name `s3:////` instead of minio prefix `s3://minio//`. At the same time, the `S3_ENDPOINT` is needed as well. \ - Follow these guides to get set-up with minio: - * Server: https://docs.min.io/docs/minio-quickstart-guide.html - * MC Client: https://docs.min.io/docs/minio-client-quickstart-guide.html + 注意:确保默认端口6006对浏览器的主机打开。 - For example, the following commands can be used to create minio storage: - ```bash - ./mc alias set s3 http://10.150.148.189:9000 minioadmin minioadmin - ./mc mb s3/profiler --region=us-east-1 - ./mc cp ~/notebook/version_2 s3/profiler/ --recursive - export AWS_ACCESS_KEY_ID=minioadmin - export AWS_SECRET_ACCESS_KEY=minioadmin - export AWS_REGION=us-east-1 - export S3_USE_HTTPS=0 - export S3_VERIFY_SSL=0 - export S3_ENDPOINT=http://localhost:9000 - tensorboard --logdir=s3://profiler/version_2/ --bind_all - ``` + 如果需要切换端口号需要在尾部加上指定的端口号`--port=6007` - * Azure Blob (https://\.blob.core.windows.net) + `tensorboard --logdir=./samples --port=6007` - Install `azure-storage-blob`. Optionally, set environment variable `AZURE_STORAGE_CONNECTION_STRING`. +* 在浏览器上打开tensorboard - * Google Cloud (GS://) + 在浏览器中打开URL: `http://localhost:6006`。 + 如果tensorboard启动命令使用`--bind_all` , 主机名不是`localhost`,而是绑定的主机ip,可以在cmd之后打印的日志中查找。 - Install `google-cloud-storage`. +* PYTORCH_PROFILER选项卡 - --- - > **_NOTES:_** For AWS S3, Google Cloud and Azure Blob, the trace files need to be put on a top level folder under bucket/container. - --- + 如果`--logdir` 下的文件太大或太多,请等候,刷新浏览器查看加载结果。 - We prepared some sample data in blob, you can also access it using the command +### 页面展示说明 - tensorboard --logdir=https://torchtbprofiler.blob.core.windows.net/torchtbprofiler/demo/ --bind_all - - and open TensorBoard your browser to see all the views described below. - - Note: for accessing data in Azure Blob, you need to install torch-tb-profiler with `pip install torch-tb-profiler[blob]` - -### Quick Usage Instructions - -We regard each running with profiler enabled as a "run". -In most cases a run is a single process. If DDP is enabled, then a run includes multiple processes. -We name each process a "worker". - -Each run corresponds to a sub-folder under the folder specified by "--logdir". -Each sub-folder contains one or more chrome trace files, one for each process. -The kineto/tb_plugin/samples is an example of how the files are organized. - -You can select the run and worker on the left control panel. + 页面加载完成后,左侧视图如图。每个Runs都对应于`--logdir`指定的文件夹下的一个子文件夹(三层目录中的第一层run1, run2等)。 + 每个子文件夹包含一个或多个profiling数据文件夹。 ![Alt text](./docs/images/control_panel.PNG) -Runs: Select a run. Each run is one execution of a PyTorch application with profiling enabled. - -Views: We organize the profiling result into multiple views, -from coarse-grained (overview-level) to fine-grained (kernel-level). - -Workers: Select a worker. Each worker is a process. There could be multiple workers when DDP is used. - -Span: There may be multiple profiling trace files of different spans to be generated when using [torch.profiler.schedule](https://github.com/pytorch/pytorch/blob/master/torch/profiler/profiler.py#L24) as schedule of torch.profiler. -You can select them with this selection box. - -Currently we have the following performance diagnosis views: -- Overall View -- Operator View -- Kernel View -- Trace View -- Memory View -- Distributed View - -We describe each of these views below. - -* Overall View +Runs: `--logdir`下包含三层目录的所有数据。 - The overall view is a top level view of the process in your profiling run. - It shows an overview of time cost, including both host and GPU devices. - You can select the current worker in the left panel's "Workers" dropdown menu. +Views: 展示数据分析的多个视图,包含Operator、NPU Kernel、Trace、Memory等多个view。 - An example of overall view: - ![Alt text](./docs/images/overall_view.PNG) - - The 'GPU Summary' panel shows GPU information and usage metrics of this run, include name, global memory, compute capability of this GPU. - The 'GPU Utilization', 'Est. SM Efficiency' and 'Est. Achieved Occupancy' shows GPU usage efficiency of this run at different levels. - The 'Kernel Time using Tensor Cores' shows percent of the time Tensor Core kernels are active. - The detailed information about the above four metrics can be found at [gpu_utilization](./docs/gpu_utilization.md). - - The 'Step Time Breakdown' panel shows the performance summary. We regard each iteration (usually a mini-batch) as a step. - The time spent on each step is broken down into multiple categories as follows: - - 1. Kernel: Kernels execution time on GPU device; - - 2. Memcpy: GPU involved memory copy time (either D2D, D2H or H2D); - - 3. Memset: GPU involved memory set time; - - 4. Communication: Communication time only appear in DDP case; - - 5. Runtime: CUDA runtime execution time on host side; - Such as cudaLaunchKernel, cudaMemcpyAsync, cudaStreamSynchronize, ... - - 6. DataLoader: The data loading time spent in PyTorch DataLoader object; - - 7. CPU Exec: Host compute time, including every PyTorch operator running time; - - 8. Other: The time not included in any of the above. - - Note: The summary of all the above categories is end-to-end wall-clock time. - - The above list is ranked by priority from high to low. We count time in priority order. - The time cost with highest priority category(Kernel) is counted first, - then Memcpy, then Memset, ..., and Other is counted last. - In the following example, the "Kernel" is counted first as 7-2=5 seconds; - Then the "Memcpy" is counted as 0 seconds, because it is fully hidden by "Kernel"; - Then "CPU Exec" is counted as 2-1=1 seconds, because the [2,3] interval is hidden by "Kernel", only [1,2] interval is counted. - - In this way, summarization of all the 7 categories' counted time in a step - will be the same with this step's total wall clock time. - - ![Alt text](./docs/images/time_breakdown_priority.PNG) - - Performance Recommendation: Leverage the profiling result to automatically highlight likely bottlenecks, - and give users actionable optimization suggestions. +Workers-Spans: 多线程的情况下Profiling可能包含多组数据,通过Workers和Spans下拉框来选择不同线程和不同时间采集的数据产生的结果。 * Operator View - This view displays the performance of every PyTorch operator that is executed either on the host or device. + Operator View展示的是运行在host侧和device侧的Pytorch算子、计算算子的详细信息。 ![Alt text](./docs/images/operator_view.PNG) - Each table row is a PyTorch operator, which is a computation operator implemented by C++, - such as "aten::relu_", "aten::convolution". - Calls: How many times the operator is called in this run. + Calls: 表示的是运行过程中此算子被调用的次数。 + + Input Shapes: shapes信息。 - Device Self Duration: The accumulated time spent on GPU, not including this operator’s child operators. + Device Self Duration: 算子在device侧的耗时(除去子算子)。 - Device Total Duration: The accumulated time spent on GPU, including this operator’s child operators. + Device Total Duration: 算子在device侧的耗时。 - Host Self Duration: The accumulated time spent on Host, not including this operator’s child operators. + Host Self Duration: 算子在host侧的耗时(除去子算子)。 - Host Total Duration: The accumulated time spent on Host, including this operator’s child operators. + Host Total Duration: 算子在host侧的耗时。 - Tensor Cores Eligible: Whether this operator is eligible to use Tensor Cores. + AI Cores Eligible: 此算子是否在AICore上运行。 - Tensor Cores Self (%): Time of self-kernels with Tensor Cores / Time of self-kernels. - Self-kernels don't include kernels launched by this operator’s child operators. + AI Cores Self (%): 算子在AICore上的耗时(除去子算子) / Device Self Duration。 - Tensor Cores Total (%): Time of kernels with Tensor Cores / Time of kernels. + AI Cores Total (%): 算子在AICore上的耗时 / Device Total Duration。 - CallStack: All call stacks of this operator if it has been recorded in profiling trace file. - To dump this call stack information, you should set the 'with_stack' parameter in torch.profiler API. - The TensorBoard has integrated to VSCode, if you launch TensorBoard in VSCode, clicking this CallStack will forward to corresponding line of source code as below: + CallStack: 此算子的所有调用堆栈信息. + 说明: 由于一些算子之间存在父子关系(在trace上显示为包含关系),Self表示除去子算子的耗时,Total表示包含所有子算子的耗时。 ![Alt text](./docs/images/vscode_stack.PNG) - Note: Each above duration means wall-clock time. It doesn't mean the GPU or CPU during this period is fully utilized. - - The top 4 pie charts are visualizations of the above 4 columns of durations. - They make the breakdowns visible at a glance. - Only the top N operators sorted by duration (configurable in the text box) will be shown in the pie charts. - - The search box enables searching operators by name. - - "Group By" could choose between "Operator" and "Operator + Input Shape". - The "Input Shape" is shapes of tensors in this operator’s input argument list. - The empty "[]" means argument with scalar type. - For example, "[[32, 256, 14, 14], [1024, 256, 1, 1], [], [], [], [], [], [], []]" - means this operator has 9 input arguments, - 1st is a tensor of size 32\*256\*14\*14, - 2nd is a tensor of size 1024\*256\*1\*1, - the following 7 ones are scalar variables. + 页面展示了四个饼图和两张表,通过界面的Group By切换表格和饼图。当切换为Operator时,表格已算子名称的维度进行展示,点击某个算子 + 的View CallStack后,此算子会按照Call Stack分类展示算子信息。点击View call frames可以查看算子的调用信息。 + 当Group By切换为Operator + Input Shape时,算子以name和Input Shape为维度进行展示。 ![Alt text](./docs/images/operator_view_group_by_inputshape.PNG) * Kernel View - This view shows all kernels’ time spent on GPU. - The time is calculated by subtracting the kernel's start time from the end time. - - Note: This view does not include cudaMemcpy or cudaMemset. Because they are not kernels. + Kernel View展示算子在加速核上运行的详细信息。 ![Alt text](./docs/images/kernel_view.PNG) - * Tensor Cores Used: Whether this kernel uses Tensor Cores. - - * Total Duration: The accumulated time of all calls of this kernel. - - * Mean Duration: The average time duration of all calls. That's "Total Duration" divided by "Calls". - - * Max Duration: The maximum time duration among all calls. - - * Min Duration: The minimum time duration among all calls. - - Note: These durations only include a kernel's elapsed time on GPU device. - It does not mean the GPU is fully busy executing instructions during this time interval. - Some of the GPU cores may be idle due to reasons such as memory access latency or insufficient parallelism. - For example, there may be insufficient number of available warps per SM for the GPU to effectively - hide memory access latencies, or some SMs may be entirely idle due to an insufficient number of blocks. - Please refer to [Nvidia's best-practices guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html). - To investigate efficiency for each kernel, we calculate and show the 'Mean Blocks Per SM' and 'Mean Est. Achieved Occupancy' in the last two column. - - * Mean Blocks Per SM: Blocks per SM = Blocks of this kernel / SM number of this GPU. If this number is less than 1, it indicates the GPU multiprocessors are not fully utilized. "Mean Blocks per SM" is weighted average of all runs of this kernel name, using each run’s duration as weight. + * Calls: 算子调度的次数。 + + * Accelerator Core: 计算核。 + + * Block Dim: Task运行切分数量,对应Task运行时核数。 - * Mean Est. Achieved Occupancy: The definition of Est. Achieved Occupancy can refer to [gpu_utilization](./docs/gpu_utilization.md), It is weighted average of all runs of this kernel name, using each run’s duration as weight. + ![Alt text](./docs/images/kernel_view_group_by_statistic.PNG) - The top left pie chart is a visualization of "Total Duration" column. - It makes the breakdowns visible at a glance. - Only the top N kernels sorted by accumulated time (configurable in the text box) will be shown in the pie chart. + * Operator: 运行在npu上的算子名称。 - The top right pie chart is percent of the kernel time using and without using Tensor Cores. + * Accelerator Core Utilization: 算子执行在各类core上耗时百分比。 - The search box enables searching kernels by name. - - "Group By" could choose between "Kernel Name" and "Kernel Properties + Op Name". - - "Kernel Name" will group kernels by kernel name. - - "Kernel Properties + Op Name" will group kernels by combination of kernel name, launching operator name, - grid, block, registers per thread, and shared memory. - - ![Alt text](./docs/images/trace_view.PNG) - - * Operator: The name of PyTorch operator which launches this kernel. - - * Grid: Grid size of this kernel. - - * Block: Block size of this kernel. - - * Register Per Thread: Number of registers required for each thread executing the kernel. - - * Shared Memory: Sum of dynamic shared memory reserved, and static shared memory allocated for this kernel. + * Total Duration、 Max Duration、Avg Duration、Min Duration: 算子调用总耗时、最大耗时、平均耗时以及最小耗时。 + + 此视图包含两张饼图和两张表,可通过Group By切换表格数据:算子的详细表以及统计表。 * Trace View - This view shows timeline using the chrome tracing plugin. Each horizontal area represents a thread or a CUDA stream. - Each colored rectangle represents an operator, or a CUDA runtime, or a GPU op which executes on GPU - (such as a kernel, a CUDA memory copy, a CUDA memory set, ...) + 此视图显示使用chrome插件,展示在整个训练过程中的时序图。 ![Alt text](./docs/images/trace_view.PNG) - In the above example: - - The "thread 25772" is the CPU thread that do "backward" of neural network. - - The "thread 25738" is the main CPU thread, which mainly do data loading, forward of neural network, and model update. - - The "stream 7" is a CUDA stream, which shows all kernels of this stream. - - You can see there are 6 "ProfilerStep" at the top of "thread 1". Each "ProfilerStep" represents a mini-batch step. - - The suspended toolbar has functionalities to help view the trace line. - For example, when the up-down arrow is enabled, - you can zoom in by dragging the mouse up and keeping mouse's left button pushed down. + Trace View主要包含三个层级以及层级下各个线程上执行的算子的时序排布。 ![Alt text](./docs/images/trace_view_one_step.PNG) - The "Optimizer.step#SGD.step" and "enumerate(DataLoader)#_SingleProcessDataLoaderIter.\__next\__" - are high-level python side functions. + 目前主要包括三个层级,PTA、CANN和Ascend Hardware。可以通过选择Processes来选择要展示的层级。 - When you select the top-right corner's "Flow events" to "async", - you can see the relationship between an operator and its launched kernels. ![Alt text](./docs/images/trace_view_launch.PNG) - You can also view the gpu utilization and Est. SM Efficiency in the trace view. They are drawn alongside the timeline: + 选择只展示async_nup,可以查看框架侧算子与昇腾硬件上执行的算子的关联关系。 - ![Alt text](./docs/images/trace_view_gpu_utilization.PNG) + ![Alt text](./docs/images/trace_view_npu_utilization.PNG) - When you select the top-right corner's "Flow events" to "fwd_bwd_correlation", - you can see the relationship between forward operator and its launched backward operator. - Note: Only the backward operator's direct launching forward operator will be connected by line, - its ancestor operators which call this operator as child will not be connected. ![Alt text](./docs/images/trace_view_fwd_bwd_correlation.PNG) * Memory View - The Pytorch profiler records all memory allocation/release events and allocator's internal state during profiling. For - each operator, the plugin aggregates all the events inside its lifespan. + 展示的是Pytorch Profiler执行过程中算子级内存申请和释放的信息。 ![Alt text](./docs/images/memory_view.PNG) + ![Alt text](./docs/images/memory_view_component.PNG) - The memory kind could be selected in 'Device' selection box. For example, 'GPU0' means the following plot and tables only shows each - operator's memory usage on GPU 0, not including CPU or other GPUs. - - * Memory Curve - - Memory curve shows the memory usage trends. It helps the user get an overview on memory consumption. The 'Allocated' plot is the - total memory requested from the allocator, for example, used by tensors. The 'Reserved' plot only makes sense if the underlying - allocator make use of caching mechanism. It represents the total memory that is allocated from the operating system by the allocator. - - User can select on the memory curve plot and zoom into the selected range by pressing left mouse button and dragging on the curve. - Right click will reset the plot to the initial state. The selection will affect 'Memory Events' table and 'Memory Statistics' table - as mentioned in the following sections. - - * Memory Events - - Memory events table shows the memory allocation and release event pairs. Definition of each field in the table: - - * Operator: The immediate operator causing allocation from allocator. In pytorch, some operators such as - `aten::empty` is widely used as an API for tensor creation, in this case, we show it as ` ()`. - - * Size: The allocated memory size. - - * Allocation Time: Memory allocation time point relative to profiler start. It maybe missing from the table if the allocation event - is not included in the selected range. - - * Release Time: Memory deallocation time point relative to profiler start. It maybe missing from the table if the release event is - not included in the selected range. Notice, released memory block might still be cached by the underlying allocator. - - * Duration: The life duration of the allocated memory. It maybe missing from the table if Allocation Time or Release Time is absent. - - * Memory Statistics - - Definition of each field in the table: - - * Calls: How many times this operator is called. - - * Size Increase: The memory increase size includes all children operators. It sums up all allocation bytes and minus all the memory release bytes. - - * Self Size Increase: The memory increase size associated with the operator itself excluding that of its children. It sums up all allocation bytes and minus all the memory release bytes. - - * Allocation Count: The allocation count including all children operators. - - * Self Allocation Count: The allocation count belonging to the operator itself excluding its children. - - * Allocation Size: The allocation size including all children operators. It sums up all allocation bytes without considering the memory free. - - * Self Allocation Size: The allocation size belonging to the operator itself. It sums up all allocation bytes without considering the memory free. - - -* Distributed View - - This view will appear automatically only for DDP jobs that use nccl for communication. - There are four panels in this view: - - ![Alt text](./docs/images/distributed_view.PNG) - - * The top panel shows the information about nodes/processes/GPU hierarchy of this job. - - * The left panel in the middle is 'Computation/Communication Overview'. Definition of each legend: - * Computation: the sum of kernel time on GPU minus the overlapping time. - * Overlapping: the overlapping time of computation and communication. More overlapping represents better parallelism between computation and communication. Ideally the communication would be totally overlapped with computation. - * Communication: the total communication time minus the overlapping time. - * Other: step time minus computation and communication time. Maybe includes initialization, data loader, CPU computation, and so on. - - From this view, you can know computation-to-communication ratio of each worker and load balance between workers. For example, if the computation + overlapping time of -one worker is much larger than others, there may be a problem of loading balance or this worker may be a straggler. - - * The right panel in the middle is 'Synchronizing/Communication Overview'. Definition of each legend: - * Data Transfer Time: part in the total communication time for actual data exchanging. - * Synchronizing Time: part in the total communication time for waiting and synchronizing with other workers. - - From this view, you can know the efficiency of communication (how much ratio of total communication time is really used for exchanging data and how much is just waiting for data from other workers) - - * The 'Communication Operations Stats' summarizes the detailed statistics of all communication ops in each worker. Definition of each field: - * Calls: How many times this operator is called in this run. - * Total Size (bytes): Total data size transferred in operators of this type. - * Avg Size (bytes): Average data size transferred in each operator of this type. - * Total Latency (us): Total latency of all operators of this type. - * Avg Latency (us): Average latency of each operator of this type. - * Data Transfer Time (us): Total time actually used for data transfer in operator of this type. - * Ave Data Transfer Time (us): Average time actually used for data transfer in each operator of this type. - -* Module View - - If the torch.nn.Module information is dumped into the result Chrome tracing file by Pytorch profiler, the plugin could display the nn.Module hierarchy and summary. - - ![Alt text](./docs/images/module_view.png) - - * The top table shows each torch.nn.Module statistics information including: - * Occurrences: how many times the module is called in the training process. - * Operators: how many operators the module invokes. - * Host Total Time: The accumulated time spent on Host, including the child submodule. - * Host Self Time: The accumulated time spent on Host, not including the child submodule. - * Device Total Time: The accumulated time spent on GPU of the operators contained in the module, including the child submodule. - * Device Self Time: The accumulated time spent on GPU of the operators contained in the module, not including the child submodule. - - * The middle flamegraph shows the torch.nn.Module hierarchy information - * The bottom graph shows the main thread operators tree. + 主要包括两张折线图和两张表。可以在 'Device' 下拉框下选择要展示的NPU卡的内存使用信息。Group By可以切换总的内存使用和各个组件内存使用图表。 -* Lightning View + * Operator - If the Chrome tracing file is from PytorchLightning job, the plugin will show a Lightning View which is customized for Pytorch Lightning. - All the data of this view is from PytorchLightning framework. + 整个推理过程中,内存使用情况汇总。 - ![Alt text](./docs/images/lightning_view.png) + 表格数据代表含义: - * The top table shows the model structure. The meaning of metrics in the table is same as Module View. - * The middle flamegraph shows the model hierarchy information. - * The bottom graph shows the call tree of all hooks in PytorchLightning. + * Name: 组件侧算子名称(PTA等)。 -* Diff Run View + * Size: 申请的内存大小。 - The diff run feature helps to compare two run by logical timeline. The key comparision operators include backward, dataloader, torch.nn.Module, optimizer. If each operator contains these sub-operators internally, the diff run could be zoom in by click the bar. + * Allocation Time: 内存申请时间。 - ![Alt text](./docs/images/diff_view.png) + * Release Time: 内存释放时间。 - * The top bar chart shows each operator type and trend comparision result. - * The middle line chart shows the delta and accumulated execution time difference against each operator type. - * The bottom table show the operators difference for the following categories: - * Host Total Duration: The accumulated time spent on Host, including this operator’s child operators. - * Host Self Duration: The accumulated time spent on Host, not including this operator’s child operators. - * Device Total Duration: The accumulated time spent on GPU, including this operator’s child operators. - * Device Self Duration: The accumulated time spent on GPU, not including this operator’s child operators. + * Duration: 内存持有时间。 -### PyTorch Profiler TensorBoard Plugin 0.2 Release Notes + * Component -Known Issues: This software does not support Python 3.9.0, 3.9.1, 3.9.2. -If the TensorBoard launching reports error message "ImportError" and "circular import", -please update your Python to higher version. + 图展示的是PTA和GE组件内存使用,表格为各个组件内存使用峰值。 \ No newline at end of file diff --git a/tb_plugins/profiling/tb_plugin/docs/images/control_panel.PNG b/tb_plugins/profiling/tb_plugin/docs/images/control_panel.PNG index 31bd12d9ce7c0d5efa17056ea870de5e835a5031..148b396476c01e62a4af4561625dea6ab2337d58 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/control_panel.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/control_panel.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/kernel_view.PNG b/tb_plugins/profiling/tb_plugin/docs/images/kernel_view.PNG index 53d0c57ae5ed36130db58d4e93fd2392a4ed8760..03e19c260e1a62af5cdd1805066242ec74b53702 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/kernel_view.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/kernel_view.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/kernel_view_group_by_statistic.PNG b/tb_plugins/profiling/tb_plugin/docs/images/kernel_view_group_by_statistic.PNG new file mode 100644 index 0000000000000000000000000000000000000000..474e380f5b44ede3864bb60751feb347eb0625b3 Binary files /dev/null and b/tb_plugins/profiling/tb_plugin/docs/images/kernel_view_group_by_statistic.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/memory_view.PNG b/tb_plugins/profiling/tb_plugin/docs/images/memory_view.PNG index 1b1446dfac2b0b3f9ac9ba0d2033659d6a248ca6..9bae55c09d1a8213fdad705255f73f99d5835a4a 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/memory_view.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/memory_view.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/memory_view_component.PNG b/tb_plugins/profiling/tb_plugin/docs/images/memory_view_component.PNG new file mode 100644 index 0000000000000000000000000000000000000000..47687b924a3952460da004504da380adbf6e8781 Binary files /dev/null and b/tb_plugins/profiling/tb_plugin/docs/images/memory_view_component.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/operator_view.PNG b/tb_plugins/profiling/tb_plugin/docs/images/operator_view.PNG index 351c69883aed74573e53869b66fa21d2ed285c62..b375e2ca508e2d2dfcb7451b442fdb4461b71653 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/operator_view.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/operator_view.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/operator_view_group_by_inputshape.PNG b/tb_plugins/profiling/tb_plugin/docs/images/operator_view_group_by_inputshape.PNG index cccdd1cefb37ab7e113cb5be2ed3ec7de3ffedaf..68b4784f1a54497219c32bcdaa0128adde5f5163 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/operator_view_group_by_inputshape.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/operator_view_group_by_inputshape.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/trace_view.PNG b/tb_plugins/profiling/tb_plugin/docs/images/trace_view.PNG index aa1ced94750c0c449e9136c513b39638d4d520aa..3ee90b6fbdef5de1883ab616697ec846a169f5b4 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/trace_view.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/trace_view.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_fwd_bwd_correlation.PNG b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_fwd_bwd_correlation.PNG index c6536ac18d64694299e5389738006e408ca4931e..7926cca790f300688b495ff8db6aad16f8dbeb21 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_fwd_bwd_correlation.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_fwd_bwd_correlation.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_gpu_utilization.PNG b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_gpu_utilization.PNG deleted file mode 100644 index 4c8bbb0f54ebe6589b40d500cb33f72bca13b64c..0000000000000000000000000000000000000000 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_gpu_utilization.PNG and /dev/null differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_launch.PNG b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_launch.PNG index ec37f3a84ea009f26fd95f1a96161f55cc11a41f..d8bcaef2dd304d386c42c4c2b9c2fec202aaf83f 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_launch.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_launch.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_npu_utilization.PNG b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_npu_utilization.PNG new file mode 100644 index 0000000000000000000000000000000000000000..149362d3c3b11f43736dc0118910229e1d869ce4 Binary files /dev/null and b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_npu_utilization.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_one_step.PNG b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_one_step.PNG index 49690e3f594bf1ae6be1ab3f4079a43c863b74a5..97d004105746830763c7c8d280ac74dbd7000e06 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/trace_view_one_step.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/trace_view_one_step.PNG differ diff --git a/tb_plugins/profiling/tb_plugin/docs/images/vscode_stack.PNG b/tb_plugins/profiling/tb_plugin/docs/images/vscode_stack.PNG index afb99f06937642b207cce36db715be9f9ec78334..a109c97a1aaf0011c817e07c0ecb3224fae986f6 100644 Binary files a/tb_plugins/profiling/tb_plugin/docs/images/vscode_stack.PNG and b/tb_plugins/profiling/tb_plugin/docs/images/vscode_stack.PNG differ