From bc776d9ed2687926be8d7535b5772ca47b46f648 Mon Sep 17 00:00:00 2001 From: xinyanhe Date: Wed, 15 Nov 2023 15:02:42 +0800 Subject: [PATCH 1/5] [built-in][PyTorch][ReFormer for PyTorch]Submit ReFormer NPU adaption and performance improvement code --- .../nlp/ReFormer_for_PyTorch/README.md | 180 + .../public_address_statement.md | 48 +- .../nlp/ReFormer_for_PyTorch/requirements.txt | 12 +- .../nlp/ReFormer_for_PyTorch/run_mlm.py | 24 +- .../nlp/ReFormer_for_PyTorch/test/env_npu.sh | 47 + .../test/train_full_1p.sh | 132 + .../test/train_full_8p.sh | 136 + .../test/train_performance_1p.sh | 127 + .../test/train_performance_8p.sh | 131 + .../transformers_modify/modeling_reformer.py | 2685 ++++++++++++ .../transformers_modify/trainer.py | 3892 +++++++++++++++++ .../transformers_modify/trainer_pt_utils.py | 1126 +++++ .../transformers_modify/training_args.py | 2458 +++++++++++ 13 files changed, 10985 insertions(+), 13 deletions(-) create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/env_npu.sh create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_1p.sh create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_8p.sh create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_1p.sh create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_8p.sh create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/modeling_reformer.py create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer.py create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer_pt_utils.py create mode 100644 PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/training_args.py diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md new file mode 100644 index 0000000000..3d038b4a48 --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md @@ -0,0 +1,180 @@ +# ReFormer for PyTorch + +- [概述](概述.md) +- [准备训练环境](准备训练环境.md) +- [开始训练](开始训练.md) +- [训练结果展示](训练结果展示.md) +- [版本说明](版本说明.md) + + + +# 概述 + +## 简述 + +Reformer: The Efficient Transformer是发表在ICLR 2020的一项工作,致力于用最小的内存训练Transformer网络。本文有两位共同一作,Nikita Kitaev和Łukasz Kaiser,后者是Attention is all you need作者之一。Reformer主要涉及了四处创新点:1. 使用Axial Positional Embedding来减小位置编码矩。2. 提出了基于局部敏感性哈希算法(Locality Sensitive Hashing, LSH)的self-attention机制。3. 提出可以使用分段的方式来处理全连接层(FFN)的计算过程,每次只计算一部分,从而不需要将整个序列加载到内存。4. 使用可逆(Reversible)残差连接来代替传统残差连接,好处是在前向过程中前N-1层的中间值和输出都不需要存储了,只保留最后一层的输出。本文档描述的Reformer是基于Pytorch实现的版本。 + + +- 参考实现: + + ``` + url=https://github.com/huggingface/transformers.git + commit_id=7378726df60b9cf399aacfe372fea629c1c4c7d3 + ``` + +- 适配昇腾 AI 处理器的实现: + + ``` + url=https://gitee.com/ascend/ModelZoo-PyTorch.git + code_path=PyTorch/built-in/nlp/ + ``` + + +# 准备训练环境 + +## 准备环境 + +- 当前模型支持的 PyTorch 版本和已知三方库依赖如下表所示。 + + **表 1** 版本支持表 + + | Torch_Version | 三方库依赖版本 | + | :--------: | :----------------------------------------------------------: | + | PyTorch 1.11 | transformers==4.28.1 | + + +- 环境准备指导。 + + 请参考《[Pytorch框架训练环境准备](https://www.hiascend.com/document/detail/zh/ModelZoo/pytorchframework/ptes)》。 + +- 安装依赖。 + + 在模型源码包根目录下执行命令,安装模型对应PyTorch版本需要的依赖。 + ``` + pip install -r requirements.txt + ``` + +- 替换transformers库中相关文件 + + 将源码包根目录下transformers_modify文件夹中的各个文件分别替换到transformers库安装目录下的对应位置(基于**transformers 4.28.1**版本): + + ```shell + training_args.py -> transformers/training_args.py + trainer.py -> transformers/trainer.py + trainer_pt_utils.py -> transformers/trainer_pt_utils.py + modeling_reformer.py -> transformers/models/reformer/modeling_reformer.py + ``` + +## 获取预训练模型 +用户自行下载预训练模型至目录/checkpoint/reformer-crime-and-punishment下。Reformer预训练模型目录结构参考如下所示。 + ``` + ├── checkpoint + ├──reformer-crime-and-punishment + ├──config.json + ├──generation_config.json + ├──config.json + ├──generation_config_for_text_generation.json + ├──pytorch_model.bin + ├──rust_model.ot + ├──spiece.model + ├──tokenizer.json + ``` + +## 准备数据集 + +1. 获取数据集。 + + 用户自行获取数据集放至目录corpus下。 + + Reformer数据集目录结构参考如下所示。 + + ``` + ├── corpus + ├──train_corpus.txt + ├──test_corpus.txt + ``` + + > **说明:** + >该数据集的训练过程脚本只作为一种参考示例。 + +2. 获取词表。 + 用户自行获取词表,并放在目录lf_token下,目录结构如下。 + ``` + ├── lf_token + ├──config.json + ├──merges.txt + ├──vocab.json + ``` + + +# 开始训练 + +## 训练模型 + +1. 进入解压后的源码包根目录。 + + ``` + cd /${模型文件夹名称} + ``` + +2. 运行训练脚本。 + + 该模型支持单机单卡训练和单机8卡训练。 + + - 单机单卡训练 + + 启动单卡训练。 + + ``` + bash ./test/train_full_1p.sh # 单卡精度 + + bash ./test/train_performance_1p.sh # 单卡性能 + ``` + + - 单机8卡训练 + + 启动8卡训练。 + + ``` + bash ./test/train_full_8p.sh # 8卡精度 + + bash ./test/train_performance_8p.sh # 8卡性能 + ``` + + 训练完成后,权重文件保存在训练脚本指定的路径下,并输出模型训练精度和性能信息。 + +# 训练结果展示 + +**表 2** 训练结果展示表 +| NAME | Acc@1 | FPS | Epochs | Torch_Version | +|:-:|:-:|:-:|:-:|:-:| +| 1p-竞品A | - | 60.428 | 1 | 1.11 | +| 8p-竞品A | 0.1934 | 467.132 | 10 | 1.11 | +| 1p-NPU | - | 30.6212 | 1 | 1.11 | +| 8p-NPU | 0.1931 | 233.93 | 10 | 1.11 | + +# 公网地址说明 + +代码涉及公网地址参考 public_address_statement.md + +# 版本说明 + +## 变更 + +2023.11.15:首次发布。 + +## FAQ + +1. 在conda环境下运行报错:ImportError: libblas.so.3: cannot open shared object file: No such file or directory +``` +conda install openblas blas-devel + +conda install -c conda-forge blas +``` + +2. 报错ImportError: /home/miniconda3/envs/reformer/lib/python3.7/site-packages/sklearn/__check_build/../../scikit_learn.libs/libgomp-d22c30c5.so.1.0.0: cannot allocate memory in static TLS block +``` +find / -name "libgomp-d22c30c5.so.1.0.0" + +export LD_PRELOAD=$LD_PRELOAD:/home/miniconda3/envs/reformer/lib/python3.7/site-packages/scikit_learn.libs/libgomp-d22c30c5.so.1.0.0 +``` diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/public_address_statement.md b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/public_address_statement.md index 46fc7fd544..ec486cc2d8 100644 --- a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/public_address_statement.md +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/public_address_statement.md @@ -1,8 +1,54 @@ | 类型 | 开源代码地址 | 文件名 | 公网IP地址/公网URL地址/域名/邮箱 | 用途说明 | |:------:|:-------------------------:|:---------------------------------------------------------------------------------------------:|:--------------------:|:-----------------:| -| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.25.1/examples/pytorch/language-modeling/run_mlm.py | ./run_mlm.py | http://www.apache.org/licenses/LICENSE-2.0 | apache.org/licenses引用的公网来源说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.25.1/examples/pytorch/language-modeling/run_mlm.py | ./run_mlm.py | http://www.apache.org/licenses/LICENSE-2.0 | apache.org/licenses引用的公网来源说明 | | 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.25.1/examples/pytorch/language-modeling/run_mlm.py | ./run_mlm.py | https://huggingface.co/models?filter=fill-mask | checkpoints列表 | | 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.25.1/examples/pytorch/language-modeling/run_mlm.py | ./run_mlm.py | https://huggingface.co/datasets/ | 公共数据集地址 | | 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.25.1/examples/pytorch/language-modeling/run_mlm.py | ./run_mlm.py | https://huggingface.co/docs/datasets/loading_datasets.html. | 加载数据集的指导 | | 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.25.1/examples/pytorch/language-modeling/run_mlm.py | ./run_mlm.py | https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map | multiprocessing的map方法指导 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer_pt_utils.py | ./transformers_modify/trainer_pt_utils.py | http://www.apache.org/licenses/LICENSE-2.0 | apache.org/licenses引用的公网来源说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer_pt_utils.py | ./transformers_modify/trainer_pt_utils.py | https://github.com/numpy/numpy/blob/a47ecdea856986cd60eabbd53265c2ca5916ad5d/doc/source/user/basics.types.rst | numpy1.21.4不支持bf16的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer_pt_utils.py | ./transformers_modify/trainer_pt_utils.py | https://github.com/pytorch/pytorch/issues/16266 | pytorch存在的issue | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | http://www.apache.org/licenses/LICENSE-2.0 | apache.org/licenses引用的公网来源说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://huggingface.co/docs/transformers/model_doc/auto | 适合训练的模型列表 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://github.com/huggingface/peft | 使用peft库适配器的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://www.github.com/nvidia/apex | 安装APEX的教程 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://github.com/pytorch/torchdistx | 安装torchdistx的链接 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://github.com/intel/intel-extension-for-pytorch | 安装IPEX的教程 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://github.com/huggingface/transformers/pull/4659#issuecomment-643356021 | find_unused_parameters的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://github.com/pytorch/pytorch/issues/82963 | FSDP 错误的解决方法 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.create_study.html | optuna.create_study的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://docs.ray.io/en/latest/tune/api_docs/execution.html#tune-run | tune.run的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/trainer.py | ./transformers_modify/trainer.py | https://app.sigopt.com/docs/endpoints/experiments/create | sigopt的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | http://www.apache.org/licenses/LICENSE-2.0 | apache.org/licenses引用的公网来源说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/pytorch/xla/pull/3609 | torchrun支撑文档 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/huggingface/optimum-neuron | 使用TrainiumTrainer的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://docs.python.org/3/library/argparse#module-argparse | 使用argparse的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/huggingface/transformers/tree/main/examples | 脚本参数说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://www.tensorflow.org/tensorboard | TensorBoard使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://huggingface.co/docs/safetensors | safetensor使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/intel/intel-extension-for-pytorch | IPEX安装说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://nvidia.github.io/apex/amp | Apex使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://huggingface.co/docs/transformers/performance#tf32 | TF32模式使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://www.wandb.com/ | wandb官网 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://www.mlflow.org/ | mlflow官网 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/facebookresearch/fairscale | FairScale使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/pytorch/xla/blob/master/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py | xla选项说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/microsoft/deepspeed | deepspeed使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/huggingface/transformers/tree/main/examples | transformers训练示例脚本 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial | Ray说明文档 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group | torch.distributed.init_process_group说明文档 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://pytorch.org/get-started/pytorch-2.0/ | torch.compile说明文档 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://pytorch.org/docs/2.0/generated/torch.compile.html?highlight=torch+compile#torch.compile | torch.compile最好的默认配置 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://nvidia.github.io/apex/amp.html | AMP optimization level说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/huggingface/transformers/issues/10628 | 扩展output_dir的说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/huggingface/safetensors! | Safetensors使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/training_args.py | ./transformers_modify/training_args.py | https://github.com/pytorch/pytorch/issues/82707 | 基于transformer的models的评价指标 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | http://www.apache.org/licenses/LICENSE-2.0 | apache.org/licenses引用的公网来源说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | https://huggingface.co/models?filter=reformer | reformer模型列表 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | https://arxiv.org/pdf/1509.02897.pdf | Locality-Sensitive Hashing论文 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | https://arxiv.org/pdf/2001.04451.pdf | reformer模型论文 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0 | RevNet模型应用 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py | reformer代码灵感来源 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | https://github.com/pytorch/pytorch/pull/5617 | truncated_normal使用说明 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/reformer/modeling_reformer.py | ./transformers_modify/modeling_reformer.py | https://pytorch.org/docs/stable/nn.html#torch.nn.Module | torch.nn文档 | diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/requirements.txt b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/requirements.txt index 2883377b70..4d0cc6fcc5 100644 --- a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/requirements.txt +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/requirements.txt @@ -1,6 +1,8 @@ -datasets==2.13.2 -deepspeed==0.9.2 -evaluate==0.4.1 -numpy==1.21.6 -setuptools==68.2.2 +datasets +evaluate +numpy +sympy +scikit-learn +setuptools +accelerate >= 0.12.0 transformers==4.28.1 diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/run_mlm.py b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/run_mlm.py index 06a3c76696..10522d0b53 100644 --- a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/run_mlm.py +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/run_mlm.py @@ -1,6 +1,7 @@ #!/usr/bin/env python # coding=utf-8 # Copyright 2020 The HuggingFace Team All rights reserved. +# Copyright 2023 Huawei Technologies Co., Ltd # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -21,6 +22,10 @@ https://huggingface.co/models?filter=fill-mask """ # You can also adapt this script on your own masked language modeling task. Pointers for this are left as comments. +import torch +import torch_npu +from torch_npu.contrib import transfer_to_npu + import logging import math import os @@ -141,6 +146,13 @@ class DataTrainingArguments: default=None, metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."}, ) + debug_file: Optional[str] = field( + default=None, + metadata={"help": "An optional input debug data file to evaluate the perplexity on (a text file)."}, + ) + debug_mode: bool = field( + default=False, + ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} ) @@ -304,10 +316,10 @@ def main(): else: data_files = {} if data_args.train_file is not None: - data_files["train"] = data_args.train_file + data_files["train"] = data_args.train_file if not data_args.debug_mode else data_args.debug_file extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: - data_files["validation"] = data_args.validation_file + data_files["validation"] = data_args.validation_file if not data_args.debug_mode else data_args.debug_file extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" @@ -359,6 +371,9 @@ def main(): logger.info(f"Overriding config: {model_args.config_overrides}") config.update_from_string(model_args.config_overrides) logger.info(f"New config: {config}") + + config.is_decoder = False + config.axial_pos_embds = False tokenizer_kwargs = { "cache_dir": model_args.cache_dir, @@ -602,11 +617,6 @@ def main(): else: kwargs["dataset"] = data_args.dataset_name - if training_args.push_to_hub: - trainer.push_to_hub(**kwargs) - else: - trainer.create_model_card(**kwargs) - def _mp_fn(index): # For xla_spawn (TPUs) diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/env_npu.sh b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/env_npu.sh new file mode 100644 index 0000000000..4b217451c6 --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/env_npu.sh @@ -0,0 +1,47 @@ +#!/bin/bash +CANN_INSTALL_PATH_CONF='/etc/Ascend/ascend_cann_install.info' + +if [ -f $CANN_INSTALL_PATH_CONF ]; then + CANN_INSTALL_PATH=$(cat $CANN_INSTALL_PATH_CONF | grep Install_Path | cut -d "=" -f 2) +else + CANN_INSTALL_PATH="/usr/local/Ascend" +fi + +if [ -d ${CANN_INSTALL_PATH}/ascend-toolkit/latest ]; then + source ${CANN_INSTALL_PATH}/ascend-toolkit/set_env.sh +else + source ${CANN_INSTALL_PATH}/nnae/set_env.sh +fi +#设置device侧日志登记为error +msnpureport -g error -d 0 +msnpureport -g error -d 1 +msnpureport -g error -d 2 +msnpureport -g error -d 3 +msnpureport -g error -d 4 +msnpureport -g error -d 5 +msnpureport -g error -d 6 +msnpureport -g error -d 7 + +#将Host日志输出到串口,0-关闭/1-开启 +export ASCEND_SLOG_PRINT_TO_STDOUT=0 +#设置默认日志级别,0-debug/1-info/2-warning/3-error +export ASCEND_GLOBAL_LOG_LEVEL=3 +#设置Event日志开启标志,0-关闭/1-开启 +export ASCEND_GLOBAL_EVENT_ENABLE=0 +#设置是否开启taskque,0-关闭/1-开启 +export TASK_QUEUE_ENABLE=1 +#设置是否开启PTCopy,0-关闭/1-开启 +export PTCOPY_ENABLE=1 +#设置是否开启combined标志,0-关闭/1-开启 +export COMBINED_ENABLE=1 +#设置特殊场景是否需要重新编译,不需要修改 +export DYNAMIC_OP="ADD#MUL" +#HCCL白名单开关,1-关闭/0-开启 +export HCCL_WHITELIST_DISABLE=1 +#HCCL默认超时时间120s较少,修改为1800s对齐PyTorch默认设置 +export HCCL_CONNECT_TIMEOUT=1800 +export HCCL_EXEC_TIMEOUT=1200 +export HCCL_IF_IP=$(hostname -I |awk '{print $1}') + +ulimit -SHn 512000 + diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_1p.sh b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_1p.sh new file mode 100644 index 0000000000..e0fbbbc8c5 --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_1p.sh @@ -0,0 +1,132 @@ +#!/bin/bash + +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=1 + +#基础参数,需要模型审视修改 +#网络名称,同目录名称 +Network="ReFormer_for_PyTorch" +#训练epoch +train_epochs=10 +#训练batch_size +batch_size=16 +#学习率 +learning_rate=5e-5 + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + elif [[ $para == --batch_size* ]];then + batch_size=`echo ${para#*=}` + fi +done + +###############指定训练脚本执行路径############### +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=`pwd` +cur_path_last_diename=${cur_path##*/} +if [ x"${cur_path_last_diename}" == x"test" ];then + test_path_dir=${cur_path} + cd .. + cur_path=`pwd` +else + test_path_dir=${cur_path}/test +fi + +#################创建日志输出目录,不需要修改################# +ASCEND_DEVICE_ID=0 +if [ -d ${test_path_dir}/output/${ASCEND_DEVICE_ID} ];then + rm -rf ${test_path_dir}/output/${ASCEND_DEVICE_ID} + mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID +else + mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID +fi + +#################启动训练脚本################# + +# 非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${test_path_dir}/env_npu.sh +fi + +TOKENIZERS_PARALLELISM=false +OMP_NUM_THREADS=1 +WANDB_DISABLED=true +NUM_GPUS_PER_WORKER=1 +MASTER_PORT=12345 + +#训练开始时间,不需要修改 +start_time=$(date +%s) + +python -m torch.distributed.launch --nproc_per_node $NUM_GPUS_PER_WORKER --master_port $MASTER_PORT ./run_mlm.py \ + --model_name_or_path ./checkpoint/reformer-crime-and-punishment \ + --tokenizer_name ./lf_token/ \ + --train_file ./corpus/train_corpus.txt \ + --validation_file ./corpus/test_corpus.txt \ + --output_dir ./checkpoint/output \ + --num_train_epochs $train_epochs \ + --save_steps 50000 \ + --per_device_train_batch_size $batch_size \ + --learning_rate $learning_rate \ + --max_seq_length 5120 \ + --pad_to_max_length \ + --line_by_line True \ + --do_train \ + --do_eval \ + --overwrite_output_dir \ + --optim adamw_torch_npu_fused \ + --dataloader_num_workers 16 \ + --dataloader_pin_memory \ + --ddp_find_unused_parameters=False > ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +e2e_time=$(( $end_time - $start_time )) + +##################获取训练数据################ +#结果打印,不需要修改 +echo "------------------ Final result ------------------" +#输出性能FPS,需要模型审视修改 +FPS=`grep -a 'time cost:' ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log|awk -F 'time cost:' '{print $2}'|awk 'NR>10'|awk '{sum+=$1} END {print RANK_SIZE*batch_size*NR/sum}' RANK_SIZE=$RANK_SIZE batch_size=$batch_size` +#打印,不需要修改 +echo "Final Performance images/sec : $FPS" + +#输出训练精度,需要模型审视修改 +train_accuracy=`grep -a 'eval_accuracy' ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log|awk -F '=' '{print $2}'` +#打印,不需要修改 +echo "Final Train Accuracy : ${train_accuracy}" +echo "E2E Training Duration sec : $e2e_time" + +#性能看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'perf' + +#获取性能数据,不需要修改 +#吞吐量 +ActualFPS=${FPS} + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'train_loss' ${test_path_dir}/output/$ASCEND_DEVICE_ID/train_$ASCEND_DEVICE_ID.log|awk -F '=' '{print $2}' > ${test_path_dir}/output/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt +#最后一个迭代loss值,不需要修改 +ActualLoss=`awk 'END {print}' ${test_path_dir}/output/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt` + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${ActualLoss}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_8p.sh b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_8p.sh new file mode 100644 index 0000000000..bf091bf03d --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_full_8p.sh @@ -0,0 +1,136 @@ +#!/bin/bash + +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=8 + +#基础参数,需要模型审视修改 +#网络名称,同目录名称 +Network="ReFormer_for_PyTorch" +#训练epoch +train_epochs=10 +#训练batch_size +batch_size=16 +#学习率 +learning_rate=5e-5 + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + elif [[ $para == --batch_size* ]];then + batch_size=`echo ${para#*=}` + fi +done + +###############指定训练脚本执行路径############### +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=`pwd` +cur_path_last_diename=${cur_path##*/} +if [ x"${cur_path_last_diename}" == x"test" ];then + test_path_dir=${cur_path} + cd .. + cur_path=`pwd` +else + test_path_dir=${cur_path}/test +fi + +#################创建日志输出目录,不需要修改################# +ASCEND_DEVICE_ID=0 +if [ -d ${test_path_dir}/output/${ASCEND_DEVICE_ID} ];then + rm -rf ${test_path_dir}/output/${ASCEND_DEVICE_ID} + mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID +else + mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID +fi + +#################启动训练脚本################# + +# 非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${test_path_dir}/env_npu.sh +fi + +TOKENIZERS_PARALLELISM=false +OMP_NUM_THREADS=1 +WANDB_DISABLED=true +NUM_GPUS_PER_WORKER=8 +MASTER_PORT=12345 + +#训练开始时间,不需要修改 +start_time=$(date +%s) + +KERNEL_NUM=$(($(nproc))) +PID_START=0 +PID_END=$((KERNEL_NUM-1)) + +taskset -c $PID_START-$PID_END python -m torch.distributed.launch --nproc_per_node $NUM_GPUS_PER_WORKER --master_port $MASTER_PORT ./run_mlm.py \ + --model_name_or_path ./checkpoint/reformer-crime-and-punishment \ + --tokenizer_name ./lf_token/ \ + --train_file ./corpus/train_corpus.txt \ + --validation_file ./corpus/test_corpus.txt \ + --output_dir ./checkpoint/output \ + --num_train_epochs $train_epochs \ + --save_steps 50000 \ + --per_device_train_batch_size $batch_size \ + --learning_rate $learning_rate \ + --max_seq_length 5120 \ + --pad_to_max_length \ + --line_by_line True \ + --do_train \ + --do_eval \ + --overwrite_output_dir \ + --optim adamw_torch_npu_fused \ + --dataloader_num_workers 16 \ + --dataloader_pin_memory \ + --ddp_find_unused_parameters=False > ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +e2e_time=$(( $end_time - $start_time )) + +##################获取训练数据################ +#结果打印,不需要修改 +echo "------------------ Final result ------------------" +#输出性能FPS,需要模型审视修改 +FPS=`grep -a 'time cost:' ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log|awk -F 'time cost:' '{print $2}'|awk 'NR>10'|awk '{sum+=$1} END {print RANK_SIZE*batch_size*NR/sum}' RANK_SIZE=$RANK_SIZE batch_size=$batch_size` +#打印,不需要修改 +echo "Final Performance images/sec : $FPS" + +#输出训练精度,需要模型审视修改 +train_accuracy=`grep -a 'eval_accuracy' ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log|awk -F '=' '{print $2}'` +#打印,不需要修改 +echo "Final Train Accuracy : ${train_accuracy}" +echo "E2E Training Duration sec : $e2e_time" + +#性能看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'perf' + +#获取性能数据,不需要修改 +#吞吐量 +ActualFPS=${FPS} + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'train_loss' ${test_path_dir}/output/$ASCEND_DEVICE_ID/train_$ASCEND_DEVICE_ID.log|awk -F '=' '{print $2}' > ${test_path_dir}/output/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt +#最后一个迭代loss值,不需要修改 +ActualLoss=`awk 'END {print}' ${test_path_dir}/output/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt` + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${ActualLoss}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_1p.sh b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_1p.sh new file mode 100644 index 0000000000..622cadeb96 --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_1p.sh @@ -0,0 +1,127 @@ +#!/bin/bash + +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=1 + +#基础参数,需要模型审视修改 +#网络名称,同目录名称 +Network="ReFormer_for_PyTorch" +#训练epoch +train_epochs=1 +#训练batch_size +batch_size=16 +#学习率 +learning_rate=5e-5 + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + elif [[ $para == --batch_size* ]];then + batch_size=`echo ${para#*=}` + fi +done + +###############指定训练脚本执行路径############### +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=`pwd` +cur_path_last_diename=${cur_path##*/} +if [ x"${cur_path_last_diename}" == x"test" ];then + test_path_dir=${cur_path} + cd .. + cur_path=`pwd` +else + test_path_dir=${cur_path}/test +fi + +#################创建日志输出目录,不需要修改################# +ASCEND_DEVICE_ID=0 +if [ -d ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID} ];then + rm -rf ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID} + mkdir -p ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID +else + mkdir -p ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID +fi + +#################启动训练脚本################# + +# 非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${test_path_dir}/env_npu.sh +fi + +TOKENIZERS_PARALLELISM=false +OMP_NUM_THREADS=1 +WANDB_DISABLED=true +NUM_GPUS_PER_WORKER=1 +MASTER_PORT=12345 + +#训练开始时间,不需要修改 +start_time=$(date +%s) + +python -m torch.distributed.launch --nproc_per_node $NUM_GPUS_PER_WORKER --master_port $MASTER_PORT ./run_mlm.py \ + --model_name_or_path ./checkpoint/reformer-crime-and-punishment \ + --tokenizer_name ./lf_token/ \ + --train_file ./corpus/train_corpus.txt \ + --validation_file ./corpus/test_corpus.txt \ + --output_dir ./checkpoint/output \ + --num_train_epochs $train_epochs \ + --save_steps 50000 \ + --per_device_train_batch_size $batch_size \ + --learning_rate $learning_rate \ + --max_seq_length 5120 \ + --pad_to_max_length \ + --line_by_line True \ + --do_train \ + --overwrite_output_dir \ + --optim adamw_torch_npu_fused \ + --dataloader_num_workers 16 \ + --dataloader_pin_memory \ + --ddp_find_unused_parameters=False > ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +e2e_time=$(( $end_time - $start_time )) + +##################获取训练数据################ +#结果打印,不需要修改 +echo "------------------ Final result ------------------" +#输出性能FPS,需要模型审视修改 +FPS=`grep -a 'time cost:' ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log|awk -F 'time cost:' '{print $2}'|awk 'NR>10'|awk '{sum+=$1} END {print RANK_SIZE*batch_size*NR/sum}' RANK_SIZE=$RANK_SIZE batch_size=$batch_size` +#打印,不需要修改 +echo "Final Performance images/sec : $FPS" + +echo "E2E Training Duration sec : $e2e_time" + +#性能看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'perf' + +#获取性能数据,不需要修改 +#吞吐量 +ActualFPS=${FPS} + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'train_loss' ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/train_$ASCEND_DEVICE_ID.log|awk -F '=' '{print $2}' > ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt +#最后一个迭代loss值,不需要修改 +ActualLoss=`awk 'END {print}' ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt` + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${ActualLoss}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_8p.sh b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_8p.sh new file mode 100644 index 0000000000..543fb25351 --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/test/train_performance_8p.sh @@ -0,0 +1,131 @@ +#!/bin/bash + +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=8 + +#基础参数,需要模型审视修改 +#网络名称,同目录名称 +Network="ReFormer_for_PyTorch" +#训练epoch +train_epochs=1 +#训练batch_size +batch_size=16 +#学习率 +learning_rate=5e-5 + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + elif [[ $para == --batch_size* ]];then + batch_size=`echo ${para#*=}` + fi +done + +###############指定训练脚本执行路径############### +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=`pwd` +cur_path_last_diename=${cur_path##*/} +if [ x"${cur_path_last_diename}" == x"test" ];then + test_path_dir=${cur_path} + cd .. + cur_path=`pwd` +else + test_path_dir=${cur_path}/test +fi + +#################创建日志输出目录,不需要修改################# +ASCEND_DEVICE_ID=0 +if [ -d ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID} ];then + rm -rf ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID} + mkdir -p ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID +else + mkdir -p ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID +fi + +#################启动训练脚本################# + +# 非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${test_path_dir}/env_npu.sh +fi + +TOKENIZERS_PARALLELISM=false +OMP_NUM_THREADS=1 +WANDB_DISABLED=true +NUM_GPUS_PER_WORKER=8 +MASTER_PORT=12345 + +#训练开始时间,不需要修改 +start_time=$(date +%s) + +KERNEL_NUM=$(($(nproc))) +PID_START=0 +PID_END=$((KERNEL_NUM-1)) + +taskset -c $PID_START-$PID_END python -m torch.distributed.launch --nproc_per_node $NUM_GPUS_PER_WORKER --master_port $MASTER_PORT ./run_mlm.py \ + --model_name_or_path ./checkpoint/reformer-crime-and-punishment \ + --tokenizer_name ./lf_token/ \ + --train_file ./corpus/train_corpus.txt \ + --validation_file ./corpus/test_corpus.txt \ + --output_dir ./checkpoint/output \ + --num_train_epochs $train_epochs \ + --save_steps 50000 \ + --per_device_train_batch_size $batch_size \ + --learning_rate $learning_rate \ + --max_seq_length 5120 \ + --pad_to_max_length \ + --line_by_line True \ + --do_train \ + --overwrite_output_dir \ + --optim adamw_torch_npu_fused \ + --dataloader_num_workers 16 \ + --dataloader_pin_memory \ + --ddp_find_unused_parameters=False > ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +e2e_time=$(( $end_time - $start_time )) + +##################获取训练数据################ +#结果打印,不需要修改 +echo "------------------ Final result ------------------" +#输出性能FPS,需要模型审视修改 +FPS=`grep -a 'time cost:' ${test_path_dir}/output_perf/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log|awk -F 'time cost:' '{print $2}'|awk 'NR>10'|awk '{sum+=$1} END {print RANK_SIZE*batch_size*NR/sum}' RANK_SIZE=$RANK_SIZE batch_size=$batch_size` +#打印,不需要修改 +echo "Final Performance images/sec : $FPS" + +echo "E2E Training Duration sec : $e2e_time" + +#性能看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'perf' + +#获取性能数据,不需要修改 +#吞吐量 +ActualFPS=${FPS} + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'train_loss' ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/train_$ASCEND_DEVICE_ID.log|awk -F '=' '{print $2}' > ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt +#最后一个迭代loss值,不需要修改 +ActualLoss=`awk 'END {print}' ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt` + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${ActualLoss}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output_perf/$ASCEND_DEVICE_ID/${CaseName}.log diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/modeling_reformer.py b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/modeling_reformer.py new file mode 100644 index 0000000000..26db1e736b --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/modeling_reformer.py @@ -0,0 +1,2685 @@ +# coding=utf-8 +# Copyright 2020 The Trax Authors and The HuggingFace Inc. team. +# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. +# Copyright 2023 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""PyTorch REFORMER model.""" + +import sys +from collections import namedtuple +from dataclasses import dataclass +from functools import reduce +from operator import mul +from typing import List, Optional, Tuple, Union + +import numpy as np +import torch +from torch import nn +from torch.autograd.function import Function +from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss + +from ...activations import ACT2FN +from ...modeling_outputs import CausalLMOutput, MaskedLMOutput, QuestionAnsweringModelOutput, SequenceClassifierOutput +from ...modeling_utils import PreTrainedModel +from ...pytorch_utils import apply_chunking_to_forward +from ...utils import ( + DUMMY_INPUTS, + DUMMY_MASK, + ModelOutput, + add_code_sample_docstrings, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from .configuration_reformer import ReformerConfig + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "google/reformer-crime-and-punishment" +_CONFIG_FOR_DOC = "ReformerConfig" + +REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "google/reformer-crime-and-punishment", + "google/reformer-enwik8", + # See all Reformer models at https://huggingface.co/models?filter=reformer +] + + +# Define named tuples for nn.Modules here +LSHSelfAttentionOutput = namedtuple("LSHSelfAttentionOutput", ["hidden_states", "attention_probs", "buckets"]) +LocalSelfAttentionOutput = namedtuple("LocalSelfAttentionOutput", ["hidden_states", "attention_probs"]) +AttentionOutput = namedtuple("AttentionOutput", ["hidden_states", "attention_probs", "buckets"]) +ReformerOutput = namedtuple("ReformerOutput", ["hidden_states", "attn_output", "attention_probs", "buckets"]) +ReformerBackwardOutput = namedtuple( + "ReformerBackwardOutput", ["attn_output", "hidden_states", "grad_attn_output", "grad_hidden_states"] +) +ReformerEncoderOutput = namedtuple( + "ReformerEncoderOutput", + ["hidden_states", "all_hidden_states", "all_attentions", "past_buckets_states"], +) + + +def _stable_argsort(vector, dim): + # this function scales the vector so that torch.argsort is stable. + # torch.argsort is not stable on its own + scale_offset = torch.arange(vector.shape[dim], device=vector.device).view(1, 1, -1) + scale_offset = scale_offset.expand(vector.shape) + scaled_vector = vector.shape[dim] * vector + (scale_offset % vector.shape[dim]) + scaled_vector = scaled_vector.float() + return torch.argsort(scaled_vector, dim=dim) + + +def _get_least_common_mult_chunk_len(config): + attn_types = config.attn_layers + attn_types_set = set(attn_types) + if len(attn_types_set) == 1 and attn_types[0] == "lsh": + return config.lsh_attn_chunk_length + elif len(attn_types_set) == 1 and attn_types[0] == "local": + return config.local_attn_chunk_length + elif len(attn_types_set) == 2 and attn_types_set == {"lsh", "local"}: + return np.lcm(config.lsh_attn_chunk_length, config.local_attn_chunk_length) + else: + raise NotImplementedError( + f"Only attn layer types 'lsh' and 'local' exist, but `config.attn_layers`: {config.attn_layers}. Select " + "attn layer types from ['lsh', 'local'] only." + ) + + +def _get_min_chunk_len(config): + attn_types = config.attn_layers + attn_types_set = set(attn_types) + if len(attn_types_set) == 1 and attn_types[0] == "lsh": + return config.lsh_attn_chunk_length + elif len(attn_types_set) == 1 and attn_types[0] == "local": + return config.local_attn_chunk_length + elif len(attn_types_set) == 2 and attn_types_set == {"lsh", "local"}: + return min(config.lsh_attn_chunk_length, config.local_attn_chunk_length) + else: + raise NotImplementedError( + f"Only attn layer types 'lsh' and 'local' exist, but `config.attn_layers`: {config.attn_layers}. Select " + "attn layer types from ['lsh', 'local'] only." + ) + + +class AxialPositionEmbeddings(nn.Module): + """ + Constructs axial position embeddings. Useful for very long input sequences to save memory and time. + """ + + def __init__(self, config): + super().__init__() + self.axial_pos_shape = config.axial_pos_shape + self.axial_pos_embds_dim = config.axial_pos_embds_dim + self.dropout = config.hidden_dropout_prob + + self.least_common_mult_chunk_length = _get_least_common_mult_chunk_len(config) + self.weights = nn.ParameterList() + + if sum(self.axial_pos_embds_dim) != config.hidden_size: + raise ValueError( + f"Make sure that config.axial_pos_embds factors: {self.axial_pos_embds_dim} sum to " + f"config.hidden_size: {config.hidden_size}" + ) + + # create weights + for axis, axial_pos_embd_dim in enumerate(self.axial_pos_embds_dim): + # create expanded shapes + ax_shape = [1] * len(self.axial_pos_shape) + ax_shape[axis] = self.axial_pos_shape[axis] + ax_shape = tuple(ax_shape) + (axial_pos_embd_dim,) + + # create tensor and init + self.weights.append(nn.Parameter(torch.ones(ax_shape, dtype=torch.float32))) + + def forward(self, position_ids): + # broadcast weights to correct shape + batch_size = position_ids.shape[0] + sequence_length = position_ids.shape[1] + + broadcasted_weights = [ + weight.expand((batch_size,) + self.axial_pos_shape + weight.shape[-1:]) for weight in self.weights + ] + + if self.training is True: + if reduce(mul, self.axial_pos_shape) != sequence_length: + raise ValueError( + f"If training, make sure that config.axial_pos_shape factors: {self.axial_pos_shape} multiply to " + f"sequence length. Got prod({self.axial_pos_shape}) != sequence_length: {sequence_length}. " + f"You might want to consider padding your sequence length to {reduce(mul, self.axial_pos_shape)} " + "or changing config.axial_pos_shape." + ) + + if self.dropout > 0: + weights = torch.cat(broadcasted_weights, dim=-1) + # permute weights so that 2D correctly drops dims 1 and 2 + transposed_weights = weights.transpose(2, 1) + # drop entire matrix of last two dims (prev dims 1 and 2) + dropped_transposed_weights = nn.functional.dropout2d( + transposed_weights, p=self.dropout, training=self.training + ) + dropped_weights = dropped_transposed_weights.transpose(2, 1) + + position_encodings = torch.reshape(dropped_weights, (batch_size, sequence_length, -1)) + + else: + position_encodings = torch.cat( + [torch.reshape(weight, (batch_size, sequence_length, -1)) for weight in broadcasted_weights], + dim=-1, + ) + + else: + if reduce(mul, self.axial_pos_shape) < sequence_length: + raise ValueError( + f"Make sure that config.axial_pos_shape factors: {self.axial_pos_shape} multiply at least to " + f"max(sequence_length, least_common_mult_chunk_length): max({sequence_length}, " + f"{self.least_common_mult_chunk_length})." + ) + + # compute how many columns are needed + max_position_id = position_ids.max().item() + required_pos_encodings_columns = -(-(max_position_id + 1) // self.axial_pos_shape[1]) + + # cut to columns that are needed + position_encodings = torch.cat( + [weight[:, :required_pos_encodings_columns] for weight in broadcasted_weights], dim=-1 + ) + position_encodings = torch.reshape(position_encodings, (batch_size, -1, position_encodings.shape[-1])) + + # select correct position encodings + position_encodings = torch.cat( + [ + torch.index_select(position_encodings[i], 0, position_ids[i]).unsqueeze(0) + for i in range(batch_size) + ], + dim=0, + ) + + return position_encodings + + +class PositionEmbeddings(nn.Module): + """Constructs conventional position embeddings of shape `[max_pos_embeddings, hidden_size]`.""" + + def __init__(self, config): + super().__init__() + self.dropout = config.hidden_dropout_prob + self.embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size) + + def forward(self, position_ids): + position_embeddings = self.embedding(position_ids) + position_embeddings = nn.functional.dropout(position_embeddings, p=self.dropout, training=self.training) + return position_embeddings + + +class ReformerEmbeddings(nn.Module): + """Construct the embeddings from word, position and token_type embeddings.""" + + def __init__(self, config): + super().__init__() + self.max_position_embeddings = config.max_position_embeddings + self.dropout = config.hidden_dropout_prob + + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size) + self.position_embeddings = ( + AxialPositionEmbeddings(config) if config.axial_pos_embds else PositionEmbeddings(config) + ) + + def forward(self, input_ids=None, position_ids=None, inputs_embeds=None, start_idx_pos_encodings=0): + if input_ids is not None: + input_shape = input_ids.size() + device = input_ids.device + else: + input_shape = inputs_embeds.size()[:-1] + device = inputs_embeds.device + + seq_length = input_shape[1] + if position_ids is None: + position_ids = torch.arange( + start_idx_pos_encodings, start_idx_pos_encodings + seq_length, dtype=torch.long, device=device + ) + position_ids = position_ids.unsqueeze(0).expand(input_shape) + + if inputs_embeds is None: + inputs_embeds = self.word_embeddings(input_ids) + + if position_ids.shape[-1] > self.max_position_embeddings: + raise ValueError( + f"Sequence Length: {position_ids.shape[-1]} has to be less or equal than " + f"config.max_position_embeddings {self.max_position_embeddings}." + ) + + # dropout + embeddings = nn.functional.dropout(inputs_embeds, p=self.dropout, training=self.training) + + # add positional embeddings + position_embeddings = self.position_embeddings(position_ids) + embeddings = embeddings + position_embeddings + return embeddings + + +class EfficientAttentionMixin: + """ + A few utilities for nn.Modules in Reformer, to be used as a mixin. + """ + + def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after): + """ + Used to implement attention between consecutive chunks. + + Args: + vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...] + num_chunks_before: chunks before current chunk to include in attention + num_chunks_after: chunks after current chunk to include in attention + + Returns: + tensor of shape [num_chunks, N * chunk_length, ...], where N = (1 + num_chunks_before + num_chunks_after). + """ + if num_chunks_before == 0 and num_chunks_after == 0: + return vectors + + slices = [] + for i in range(-num_chunks_before, num_chunks_after + 1): + if i == 0: + slices.append(vectors) + else: + slices.append(torch.cat([vectors[:, :, i:, ...], vectors[:, :, :i, ...]], dim=2)) + return torch.cat(slices, dim=3) + + def _split_hidden_size_dim(self, x, num_attn_heads, attn_head_size): + """ + splits hidden_size dim into attn_head_size and num_attn_heads + """ + new_x_shape = x.size()[:-1] + (num_attn_heads, attn_head_size) + x = x.view(*new_x_shape) + return x.transpose(2, 1) + + def _merge_hidden_size_dims(self, x, num_attn_heads, attn_head_size): + """ + merges attn_head_size dim and num_attn_heads dim into hidden_size + """ + x = x.permute(0, 2, 1, 3) + return torch.reshape(x, (x.size()[0], -1, num_attn_heads * attn_head_size)) + + def _split_seq_length_dim_to(self, vectors, dim_factor_1, dim_factor_2, num_attn_heads, attn_head_size=None): + """ + splits sequence length dim of vectors into `dim_factor_1` and `dim_factor_2` dims + """ + batch_size = vectors.shape[0] + split_dim_shape = (batch_size, num_attn_heads, dim_factor_1, dim_factor_2) + + if len(vectors.shape) == 4: + return torch.reshape(vectors, split_dim_shape + (attn_head_size,)) + elif len(vectors.shape) == 3: + return torch.reshape(vectors, split_dim_shape) + else: + raise ValueError(f"Input vector rank should be one of [3, 4], but is: {len(vectors.shape)}") + + +class LSHSelfAttention(nn.Module, EfficientAttentionMixin): + def __init__(self, config): + super().__init__() + self.config = config + + self.chunk_length = config.lsh_attn_chunk_length + self.num_hashes = config.num_hashes + self.num_buckets = config.num_buckets + self.num_chunks_before = config.lsh_num_chunks_before + self.num_chunks_after = config.lsh_num_chunks_after + self.hash_seed = config.hash_seed + self.is_decoder = config.is_decoder + self.max_position_embeddings = config.max_position_embeddings + + self.dropout = config.lsh_attention_probs_dropout_prob + + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = config.attention_head_size + self.all_head_size = self.num_attention_heads * self.attention_head_size + self.hidden_size = config.hidden_size + + # projection matrices + self.query_key = nn.Linear(self.hidden_size, self.all_head_size, bias=False) + self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False) + + # save mask value here. Need fp32 and fp16 mask values + self.register_buffer("self_mask_value_float16", torch.tensor(-1e3)) + self.register_buffer("self_mask_value_float32", torch.tensor(-1e5)) + self.register_buffer("mask_value_float16", torch.tensor(-1e4)) + self.register_buffer("mask_value_float32", torch.tensor(-1e9)) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + num_hashes=None, + buckets=None, + past_buckets_states=None, + use_cache=False, + output_attentions=False, + **kwargs, + ): + sequence_length = hidden_states.shape[1] + batch_size = hidden_states.shape[0] + + # num hashes can optionally be overwritten by user + num_hashes = num_hashes if num_hashes is not None else self.num_hashes + + do_cached_attention = use_cache and past_buckets_states[1] is not None + + # check if cache shall be used and that hidden states are already cached + if do_cached_attention: + assert sequence_length == 1, ( + "At the moment, auto-regressive language generation is only possible one word at a time. Make sure" + f" that input sequence length {sequence_length} equals 1, when `past_buckets_states` is passed." + ) + past_buckets = past_buckets_states[0] + past_states = past_buckets_states[1] + + # get query vector + query_vectors = self.query_key(hidden_states) + query_vectors = self._split_hidden_size_dim( + query_vectors, self.num_attention_heads, self.attention_head_size + ) + + if past_buckets is not None: + key_value_hidden_states, sorted_bucket_idx, buckets = self._get_relevant_hid_states_and_buckets( + query_vectors=query_vectors, + attention_mask=attention_mask, + num_hashes=num_hashes, + hidden_states=hidden_states, + past_states=past_states, + past_buckets=past_buckets, + ) + + query_key_vectors = self._query_per_attn_head(key_value_hidden_states) + value_vectors = self._value_per_attn_head(key_value_hidden_states) + + # split key & value vectors by num hashes to apply + # self attention on each separately + query_key_vectors = self._split_seq_length_dim_to( + query_key_vectors, + num_hashes, + -1, + self.num_attention_heads, + self.attention_head_size, + ) + value_vectors = self._split_seq_length_dim_to( + value_vectors, + num_hashes, + -1, + self.num_attention_heads, + self.attention_head_size, + ) + # repeat query vectors across hash dimension + query_vectors = query_vectors.unsqueeze(2).repeat(1, 1, num_hashes, 1, 1) + else: + key_value_hidden_states = torch.cat([past_states, hidden_states], dim=1) + + query_key_vectors = self.query_key(key_value_hidden_states) + value_vectors = self.value(key_value_hidden_states) + + else: + # project hidden_states to query_key and value + query_vectors = None + query_key_vectors = self.query_key(hidden_states) + value_vectors = self.value(hidden_states) + + # if query key is not already split + if not do_cached_attention or past_buckets is None: + query_key_vectors = self._split_hidden_size_dim( + query_key_vectors, self.num_attention_heads, self.attention_head_size + ) + value_vectors = self._split_hidden_size_dim( + value_vectors, self.num_attention_heads, self.attention_head_size + ) + + # cache buckets for next incremental decoding + if do_cached_attention and past_buckets is None and key_value_hidden_states.shape[1] >= self.chunk_length: + buckets = self._hash_vectors(query_key_vectors, num_hashes, attention_mask) + + # free memory + del hidden_states + + assert ( + query_key_vectors.shape[-1] == self.attention_head_size + ), f"last dim of query_key_vectors is {query_key_vectors.shape[-1]} but should be {self.attention_head_size}." + assert ( + value_vectors.shape[-1] == self.attention_head_size + ), f"last dim of value_vectors is {value_vectors.shape[-1]} but should be {self.attention_head_size}." + + do_standard_self_attention = (sequence_length <= self.chunk_length) or ( + use_cache and past_buckets_states[1] is not None + ) + # LSH attention only makes sense if chunked attention should be performed + if not do_standard_self_attention: + # set `num_buckets` on the fly, recommended way to do it + if self.num_buckets is None: + self._set_num_buckets(sequence_length) + + # use cached buckets for backprop only + if buckets is None: + # hash query key vectors into buckets + buckets = self._hash_vectors(query_key_vectors, num_hashes, attention_mask) + else: + # make sure buckets has correct shape for LSH attention + buckets = buckets.view(batch_size, self.num_attention_heads, num_hashes * sequence_length) + + assert ( + int(buckets.shape[-1]) == num_hashes * sequence_length + ), f"last dim of buckets is {buckets.shape[-1]}, but should be {num_hashes * sequence_length}" + + sorted_bucket_idx, undo_sorted_bucket_idx = self._get_sorted_bucket_idx_and_undo_sorted_bucket_idx( + sequence_length, buckets, num_hashes + ) + + # make sure bucket idx is not longer then sequence length + sorted_bucket_idx_per_hash = sorted_bucket_idx % sequence_length + + # cluster query key value vectors according to hashed buckets + query_key_vectors = self._gather_by_expansion(query_key_vectors, sorted_bucket_idx_per_hash, num_hashes) + value_vectors = self._gather_by_expansion(value_vectors, sorted_bucket_idx_per_hash, num_hashes) + query_key_vectors = self._split_seq_length_dim_to( + query_key_vectors, + -1, + self.chunk_length, + self.num_attention_heads, + self.attention_head_size, + ) + value_vectors = self._split_seq_length_dim_to( + value_vectors, + -1, + self.chunk_length, + self.num_attention_heads, + self.attention_head_size, + ) + + if self.chunk_length is None: + assert self.num_chunks_before == 0 and self.num_chunks_after == 0, ( + "If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and" + " `config.num_chunks_before` are set to 0." + ) + elif do_cached_attention and past_buckets is not None: + # use max sequence length + sorted_bucket_idx_per_hash = sorted_bucket_idx + else: + # get sequence length indices + sorted_bucket_idx_per_hash = torch.arange(sequence_length, device=query_key_vectors.device).repeat( + batch_size, self.num_attention_heads, 1 + ) + + # scale key vectors + sqrt_num = np.sqrt(self.attention_head_size) + key_vectors = self._len_and_dim_norm(query_key_vectors, sqrt_num) + + # set query_vectors to query key vectors if LSH self attention + query_vectors = query_vectors if query_vectors is not None else query_key_vectors + + # free memory + del query_key_vectors + + # get attention probs + out_vectors, logits, attention_probs = self._attend( + query_vectors=query_vectors, + key_vectors=key_vectors, + value_vectors=value_vectors, + sorted_bucket_idx_per_hash=sorted_bucket_idx_per_hash, + attention_mask=attention_mask, + head_mask=head_mask, + do_standard_self_attention=do_standard_self_attention, + do_cached_attention=do_cached_attention, + ) + + # free memory + del key_vectors, value_vectors + + # re-order out_vectors and logits + if not do_standard_self_attention: + # sort clusters back to correct ordering + out_vectors, logits = ReverseSort.apply(out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx) + + if not do_standard_self_attention or (do_cached_attention and past_buckets is not None): + # sum up all hash rounds + if num_hashes > 1: + out_vectors = self._split_seq_length_dim_to( + out_vectors, + num_hashes, + sequence_length, + self.num_attention_heads, + self.attention_head_size, + ) + logits = self._split_seq_length_dim_to( + logits, + num_hashes, + sequence_length, + self.num_attention_heads, + self.attention_head_size, + ).unsqueeze(-1) + + probs_vectors = torch.exp(logits - torch.logsumexp(logits, dim=2, keepdim=True)) + out_vectors = torch.sum(out_vectors * probs_vectors, dim=2) + # free memory + del probs_vectors + + # free memory + del logits + + assert out_vectors.shape == ( + batch_size, + self.num_attention_heads, + sequence_length, + self.attention_head_size, + ), ( + "out_vectors have be of shape `[batch_size, config.num_attention_heads, sequence_length," + " config.attention_head_size]`." + ) + + out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size) + + if output_attentions is False: + attention_probs = () + + if buckets is not None: + buckets = buckets.view(batch_size, self.num_attention_heads, num_hashes, -1) + + return LSHSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs, buckets=buckets) + + def _query_per_attn_head(self, hidden_states): + per_head_query_key = self.query_key.weight.reshape( + self.num_attention_heads, self.attention_head_size, self.hidden_size + ).transpose(-2, -1) + # only relevant for inference and no bias => we can use einsum here + query_key_vectors = torch.einsum("balh,ahr->balr", hidden_states, per_head_query_key) + return query_key_vectors + + def _value_per_attn_head(self, hidden_states): + per_head_value = self.value.weight.reshape( + self.num_attention_heads, self.attention_head_size, self.hidden_size + ).transpose(-2, -1) + # only relevant for inference and no bias => we can use einsum here + value_vectors = torch.einsum("balh,ahr->balr", hidden_states, per_head_value) + return value_vectors + + def _hash_vectors(self, vectors, num_hashes, attention_mask, increase_num_buckets=False): + batch_size = vectors.shape[0] + + # See https://arxiv.org/pdf/1509.02897.pdf + # We sample a different random rotation for each round of hashing to + # decrease the probability of hash misses. + if isinstance(self.num_buckets, int): + assert ( + self.num_buckets % 2 == 0 + ), f"There should be an even number of buckets, but `self.num_buckets`: {self.num_buckets}" + rotation_size = self.num_buckets + num_buckets = self.num_buckets + else: + # Factorize the hash if self.num_buckets is a list or tuple + rotation_size, num_buckets = 0, 1 + for bucket_factor in self.num_buckets: + assert ( + bucket_factor % 2 == 0 + ), f"The number of buckets should be even, but `num_bucket`: {bucket_factor}" + rotation_size = rotation_size + bucket_factor + num_buckets = num_buckets * bucket_factor + + # remove gradient + vectors = vectors.detach() + + if self.hash_seed is not None: + # for determinism + torch.manual_seed(self.hash_seed) + + rotations_shape = (self.num_attention_heads, vectors.shape[-1], num_hashes, rotation_size // 2) + # create a random self.attention_head_size x num_hashes x num_buckets/2 + random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype) + # Output dim: Batch_Size x Num_Attn_Heads x Num_Hashes x Seq_Len x Num_Buckets/2 + rotated_vectors = torch.einsum("bmtd,mdhr->bmhtr", vectors, random_rotations) + + if isinstance(self.num_buckets, int) or len(self.num_buckets) == 1: + rotated_vectors = torch.cat([rotated_vectors, -rotated_vectors], dim=-1) + buckets = torch.argmax(rotated_vectors, dim=-1) + else: + # Get the buckets for them and combine. + buckets, cur_sum, cur_product = None, 0, 1 + for bucket_factor in self.num_buckets: + rotated_vectors_factor = rotated_vectors[..., cur_sum : cur_sum + (bucket_factor // 2)] + cur_sum = cur_sum + bucket_factor // 2 + rotated_vectors_factor = torch.cat([rotated_vectors_factor, -rotated_vectors_factor], dim=-1) + if buckets is None: + buckets = torch.argmax(rotated_vectors_factor, dim=-1) + else: + buckets = buckets + (cur_product * torch.argmax(rotated_vectors_factor, dim=-1)) + + cur_product = cur_product * bucket_factor + + if attention_mask is not None and (attention_mask.sum().item() < batch_size * attention_mask.shape[-1]): + # add an extra bucket for padding tokens only + num_buckets = num_buckets + 1 + # assign padding tokens extra bucket + buckets_mask = attention_mask.to(torch.bool)[:, None, None, :].expand(buckets.shape) + buckets = torch.where( + buckets_mask, buckets, torch.tensor(num_buckets - 1, dtype=torch.long, device=buckets.device) + ) + elif increase_num_buckets: + num_buckets = num_buckets + 1 + + # buckets is now (Batch_size x Num_Attn_Heads x Num_Hashes x Seq_Len). + # Next we add offsets so that bucket numbers from different hashing rounds don't overlap. + offsets = torch.arange(num_hashes, device=vectors.device) + offsets = (offsets * num_buckets).view((1, 1, -1, 1)) + + # expand to batch size and num attention heads + offsets = offsets.expand((batch_size, self.num_attention_heads) + offsets.shape[-2:]) + offset_buckets = (buckets + offsets).flatten(start_dim=2, end_dim=3) + + return offset_buckets + + def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(self, sequence_length, buckets, num_hashes): + # no gradients are needed + with torch.no_grad(): + # hash-based sort + sorted_bucket_idx = _stable_argsort(buckets, dim=-1) + + # create simple indices to scatter to, to have undo sort + indices = ( + torch.arange(sorted_bucket_idx.shape[-1], device=buckets.device) + .view(1, 1, -1) + .expand(sorted_bucket_idx.shape) + ) + + # get undo sort + undo_sorted_bucket_idx = sorted_bucket_idx.new(*sorted_bucket_idx.size()) + undo_sorted_bucket_idx.scatter_(-1, sorted_bucket_idx, indices) + + return sorted_bucket_idx, undo_sorted_bucket_idx + + def _set_num_buckets(self, sequence_length): + # `num_buckets` should be set to 2 * sequence_length // chunk_length as recommended in paper + num_buckets_pow_2 = (2 * (sequence_length // self.chunk_length)).bit_length() - 1 + # make sure buckets are power of 2 + num_buckets = 2**num_buckets_pow_2 + + # factorize `num_buckets` if `num_buckets` becomes too large + num_buckets_limit = 2 * max( + int((self.max_position_embeddings // self.chunk_length) ** (0.5)), + self.chunk_length, + ) + if num_buckets > num_buckets_limit: + num_buckets = [2 ** (num_buckets_pow_2 // 2), 2 ** (num_buckets_pow_2 - num_buckets_pow_2 // 2)] + + logger.warning(f"config.num_buckets is not set. Setting config.num_buckets to {num_buckets}...") + + # set num buckets in config to be properly saved + self.config.num_buckets = num_buckets + self.num_buckets = num_buckets + + def _attend( + self, + query_vectors, + key_vectors, + value_vectors, + sorted_bucket_idx_per_hash, + attention_mask, + head_mask, + do_standard_self_attention, + do_cached_attention, + ): + # look at previous and following chunks if chunked attention + if not do_standard_self_attention: + key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after) + value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after) + + # get logits and dots + # (BS, NumAttn, NumHash x NumChunk, Chunk_L x Hidden),(BS, NumAttn, NumHash x NumChunk, Chunk_L * (Num_bef + Num_aft + 1) x Hidden) -> (BS, NumAttn, NumHash x NumChunk, Chunk_L, Chunk_L * (1 + Num_bef + Num_aft)) + query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2)) + + # free memory + del query_vectors, key_vectors + + # if chunked attention split bucket idxs to query and key + if not do_standard_self_attention: + query_bucket_idx = self._split_seq_length_dim_to( + sorted_bucket_idx_per_hash, -1, self.chunk_length, self.num_attention_heads + ) + key_value_bucket_idx = self._look_adjacent(query_bucket_idx, self.num_chunks_before, self.num_chunks_after) + elif do_cached_attention and query_key_dots.ndim > 4: + key_value_bucket_idx = sorted_bucket_idx_per_hash + query_bucket_idx = ( + key_value_bucket_idx.new_ones(key_value_bucket_idx.shape[:-1] + (1,)) * key_value_bucket_idx.max() + ) + elif do_cached_attention and query_key_dots.ndim <= 4: + query_bucket_idx = (query_key_dots.shape[-1] - 1) * torch.ones_like(query_key_dots)[:, :, :, -1] + key_value_bucket_idx = torch.arange( + query_key_dots.shape[-1], dtype=torch.long, device=query_key_dots.device + )[None, None, :].expand(query_bucket_idx.shape[:2] + (-1,)) + else: + query_bucket_idx = key_value_bucket_idx = sorted_bucket_idx_per_hash + + # get correct mask values depending on precision + if query_key_dots.dtype == torch.float16: + self_mask_value = self.self_mask_value_float16.half() + mask_value = self.mask_value_float16.half() + else: + self_mask_value = self.self_mask_value_float32 + mask_value = self.mask_value_float32 + + if not do_cached_attention: + mask = self._compute_attn_mask( + query_bucket_idx, + key_value_bucket_idx, + attention_mask, + query_key_dots.shape, + do_standard_self_attention, + ) + + if mask is not None: + query_key_dots = torch.where(mask, query_key_dots, mask_value) + + # free memory + del mask + + # Self mask is ALWAYS applied. + # From the reformer paper (https://arxiv.org/pdf/2001.04451.pdf): + # " While attention to the future is not allowed, typical implementations of the + # Transformer do allow a position to attend to itself. + # Such behavior is undesirable in a shared-QK formulation because the dot-product + # of a query vector with itself will almost always be greater than the dot product of a + # query vector with a vector at another position. We therefore modify the masking + # to forbid a token from attending to itself, except in situations + # where a token has no other valid attention targets (e.g. the first token in a sequence) " + + self_mask = torch.ne(query_bucket_idx.unsqueeze(-1), key_value_bucket_idx.unsqueeze(-2)).to( + query_bucket_idx.device + ) + + # apply self_mask + query_key_dots = torch.where(self_mask, query_key_dots, self_mask_value) + + # free memory + del self_mask + + logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True) + # dots shape is `[batch_size, num_attn_heads, num_hashes * seq_len // chunk_length, chunk_length, chunk_length * (1 + num_chunks_before + num_chunks_after)]` + attention_probs = torch.exp(query_key_dots - logits) + + # free memory + del query_key_dots + + # dropout + attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + # attend values + out_vectors = torch.matmul(attention_probs, value_vectors) + + # free memory + del value_vectors + + # merge chunk length + if out_vectors.ndim > 4: + logits = logits.flatten(start_dim=2, end_dim=3).squeeze(-1) + out_vectors = out_vectors.flatten(start_dim=2, end_dim=3) + + return out_vectors, logits, attention_probs + + def _compute_attn_mask( + self, query_indices, key_indices, attention_mask, query_key_dot_shape, do_standard_self_attention + ): + # attention mask for LSH + if attention_mask is not None: + # if chunked attention, the attention mask has to correspond to LSH order + attention_mask = attention_mask.to(torch.bool)[:, None, :] + if not do_standard_self_attention: + # expand attn_mask to fit with key_value_bucket_idx shape + attention_mask = attention_mask[:, None, :] + attention_mask = attention_mask.expand(query_indices.shape[:-1] + (-1,)) + # extract attention mask from LSH sorted key_indices + attention_mask = torch.gather(attention_mask, -1, key_indices) + + attention_mask = attention_mask.unsqueeze(-2).expand(query_key_dot_shape) + + # Causal mask + if self.is_decoder is True: + causal_mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device) + + # add attention mask if not None + if attention_mask is not None: + attention_mask = causal_mask * attention_mask + else: + attention_mask = causal_mask + + return attention_mask + + def _get_relevant_hid_states_and_buckets( + self, query_vectors, attention_mask, num_hashes, hidden_states, past_states, past_buckets + ): + # concat hidden states + hidden_states = torch.cat([past_states, hidden_states], dim=1) + + # batch_size hidden + batch_size = hidden_states.shape[0] + sequence_length = hidden_states.shape[1] + + # check if cached buckets include pad bucket + max_bucket = self.num_buckets if isinstance(self.num_buckets, int) else reduce(mul, self.num_buckets) + + # if pad bucket was cached => need to increase num buckets for caching + increase_num_buckets = past_buckets.max() > num_hashes * max_bucket - 1 + + # retrieve query buckets + query_buckets = self._hash_vectors( + query_vectors, num_hashes, attention_mask, increase_num_buckets=increase_num_buckets + ) + + # concat buckets + concat_buckets = torch.cat([past_buckets, query_buckets.unsqueeze(-1)], dim=-1) + + # hash-based sort + bucket_idx = _stable_argsort(concat_buckets, dim=-1) + + # bucket_idx has shape: BatchSize x NumAttnHeads x NumHashes x SequenceLength + assert bucket_idx.shape == ( + batch_size, + self.num_attention_heads, + num_hashes, + sequence_length, + ), ( + f"bucket_idx should have shape {(batch_size, self.num_attention_heads, num_hashes, sequence_length)}, but" + f" has shape {bucket_idx.shape}." + ) + + # find indices of new bucket indices + relevant_bucket_idx = (bucket_idx == (bucket_idx.shape[-1] - 1)).nonzero() + + # expand relevant bucket indices to its chunks + relevant_bucket_idx_chunk = self._expand_to_indices_in_relevant_chunk(relevant_bucket_idx, sequence_length) + relevant_bucket_idx_chunk = bucket_idx[tuple(relevant_bucket_idx_chunk.transpose(0, 1))] + + # adapt bucket_idx for batch and hidden states for index select + offset = torch.arange(relevant_bucket_idx_chunk.shape[-1], device=hidden_states.device, dtype=torch.long) + bucket_idx_batch_offset = sequence_length * ( + batch_size * torch.div(offset, relevant_bucket_idx_chunk.shape[-1], rounding_mode="floor") + ) + + # add batch offset + relevant_bucket_idx_chunk_all_batch = relevant_bucket_idx_chunk + bucket_idx_batch_offset + hidden_states = hidden_states.reshape((-1, self.hidden_size)) + + # select all relevant hidden states + relevant_hidden_states = hidden_states.index_select(0, relevant_bucket_idx_chunk_all_batch) + + # reshape hidden states and bucket_idx to correct output + relevant_hidden_states = relevant_hidden_states.reshape( + batch_size, self.num_attention_heads, -1, self.hidden_size + ) + relevant_bucket_idx_chunk = relevant_bucket_idx_chunk.reshape( + batch_size, self.num_attention_heads, num_hashes, -1 + ) + + assert ( + relevant_hidden_states.shape[2] + == (self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length * num_hashes + ), ( + "There should be" + f" {(self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length * num_hashes} `hidden_states`," + f" there are {relevant_hidden_states.shape[2]} `hidden_states`." + ) + + assert ( + relevant_bucket_idx_chunk.shape[-1] + == (self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length + ), ( + "There should be" + f" {(self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length} `hidden_states`, there are" + f" {relevant_bucket_idx_chunk.shape[-1]} `bucket_idx`." + ) + + return relevant_hidden_states, relevant_bucket_idx_chunk, query_buckets + + def _expand_to_indices_in_relevant_chunk(self, indices, sequence_length): + # get relevant indices of where chunk starts and its size + start_indices_chunk = ((indices[:, -1] // self.chunk_length) - self.num_chunks_before) * self.chunk_length + total_chunk_size = self.chunk_length * (1 + self.num_chunks_before + self.num_chunks_after) + + # expand start indices and add correct chunk offset via arange + expanded_start_indices = start_indices_chunk.unsqueeze(-1).expand(indices.shape[0], total_chunk_size) + chunk_sequence_indices = expanded_start_indices + torch.arange( + total_chunk_size, device=indices.device, dtype=torch.long + ).unsqueeze(0).expand(indices.shape[0], total_chunk_size) + + # make sure that circular logic holds via % seq len + chunk_sequence_indices = chunk_sequence_indices.flatten() % sequence_length + + # expand indices and set indices correctly + indices = indices.unsqueeze(1).expand((indices.shape[0], total_chunk_size, -1)).flatten(0, 1).clone() + indices[:, -1] = chunk_sequence_indices + + return indices + + def _len_and_dim_norm(self, vectors, sqrt_num): + """ + length and attention head size dim normalization + """ + vectors = self._len_norm(vectors) + vectors = vectors / sqrt_num + return vectors + + def _len_norm(self, x, epsilon=1e-6): + """ + length normalization + """ + variance = torch.mean(x**2, -1, keepdim=True) + norm_x = x * torch.rsqrt(variance + epsilon) + return norm_x + + def _gather_by_expansion(self, vectors, idxs, num_hashes): + """ + expand dims of idxs and vectors for all hashes and gather + """ + b, size, dim = idxs.shape + dim1_idx = torch.arange(b * size, device=vectors.device) * dim + flat_idxs = idxs.reshape(-1, dim) + flat_idxs = (flat_idxs + dim1_idx[:, None]).flatten() + vectors = vectors.reshape(-1, vectors.shape[-1]) + vectors = vectors.index_select(0, flat_idxs) + return vectors.reshape(b, size, dim, -1) + + +class ReverseSort(Function): + """ + After chunked attention is applied which sorted clusters, original ordering has to be restored. Since customized + backward function is used for Reformer, the gradients of the output vectors have to be explicitly sorted here. + """ + + @staticmethod + def forward(ctx, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx): + # save sorted_bucket_idx for backprop + with torch.no_grad(): + ctx.sorted_bucket_idx = sorted_bucket_idx + + # undo sort to have correct order for next layer + expanded_undo_sort_indices = undo_sorted_bucket_idx.unsqueeze(-1).expand(out_vectors.shape) + out_vectors = torch.gather(out_vectors, 2, expanded_undo_sort_indices) + logits = torch.gather(logits, 2, undo_sorted_bucket_idx) + return out_vectors, logits + + @staticmethod + def backward(ctx, grad_out_vectors, grad_logits): + # get parameters saved in ctx + sorted_bucket_idx = ctx.sorted_bucket_idx + + expanded_sort_indices = sorted_bucket_idx.unsqueeze(-1).expand(grad_out_vectors.shape) + # reverse sort of forward + grad_out_vectors = torch.gather(grad_out_vectors, 2, expanded_sort_indices) + grad_logits = torch.gather(grad_logits, 2, sorted_bucket_idx) + + # return grad and `None` fillers for last 2 forward args + return grad_out_vectors, grad_logits, None, None + + +class LocalSelfAttention(nn.Module, EfficientAttentionMixin): + def __init__(self, config): + super().__init__() + + self.num_attention_heads = config.num_attention_heads + self.chunk_length = config.local_attn_chunk_length + self.num_chunks_before = config.local_num_chunks_before + self.num_chunks_after = config.local_num_chunks_after + self.is_decoder = config.is_decoder + self.pad_token_id = config.pad_token_id + + self.attention_head_size = config.attention_head_size + self.all_head_size = self.num_attention_heads * self.attention_head_size + self.hidden_size = config.hidden_size + + # projection matrices + self.query = nn.Linear(self.hidden_size, self.all_head_size, bias=False) + self.key = nn.Linear(self.hidden_size, self.all_head_size, bias=False) + self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False) + + self.dropout = config.local_attention_probs_dropout_prob + + # save mask value here + self.register_buffer("mask_value_float16", torch.tensor(-1e4)) + self.register_buffer("mask_value_float32", torch.tensor(-1e9)) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + past_buckets_states=None, + use_cache=False, + output_attentions=False, + **kwargs, + ): + sequence_length = hidden_states.shape[1] + batch_size = hidden_states.shape[0] + + # check if cache shall be used and that hidden states are already cached + if use_cache and past_buckets_states[1] is not None: + assert past_buckets_states[0] is None, ( + "LocalSelfAttention should not make use of `buckets`. There seems to be an error when caching" + " hidden_states_and_buckets." + ) + key_value_hidden_states = self._retrieve_relevant_hidden_states( + past_buckets_states[1], self.chunk_length, self.num_chunks_before + ) + key_value_hidden_states = torch.cat([key_value_hidden_states, hidden_states], dim=1) + + # only query vector for last token + query_vectors = self.query(hidden_states) + # compute key and value for relevant chunk + key_vectors = self.key(key_value_hidden_states) + value_vectors = self.value(key_value_hidden_states) + + # free memory + del key_value_hidden_states + else: + # project hidden_states to query, key and value + query_vectors = self.query(hidden_states) + key_vectors = self.key(hidden_states) + value_vectors = self.value(hidden_states) + + # split last dim into `config.num_attention_heads` and `config.attention_head_size` + query_vectors = self._split_hidden_size_dim(query_vectors, self.num_attention_heads, self.attention_head_size) + key_vectors = self._split_hidden_size_dim(key_vectors, self.num_attention_heads, self.attention_head_size) + value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size) + + assert ( + query_vectors.shape[-1] == self.attention_head_size + ), f"last dim of query_key_vectors is {query_vectors.shape[-1]} but should be {self.attention_head_size}." + assert ( + key_vectors.shape[-1] == self.attention_head_size + ), f"last dim of query_key_vectors is {key_vectors.shape[-1]} but should be {self.attention_head_size}." + assert ( + value_vectors.shape[-1] == self.attention_head_size + ), f"last dim of query_key_vectors is {value_vectors.shape[-1]} but should be {self.attention_head_size}." + + if self.chunk_length is None: + assert self.num_chunks_before == 0 and self.num_chunks_after == 0, ( + "If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and" + " `config.num_chunks_before` are set to 0." + ) + + # normalize key vectors + key_vectors = key_vectors / np.sqrt(self.attention_head_size) + + # get sequence length indices + indices = torch.arange(sequence_length, device=query_vectors.device).repeat( + batch_size, self.num_attention_heads, 1 + ) + + # if one should do normal n^2 self-attention + do_standard_self_attention = sequence_length <= self.chunk_length + + # if input should be chunked + if not do_standard_self_attention: + # chunk vectors + # B x Num_Attn_Head x Seq_Len // chunk_len x chunk_len x attn_head_size + query_vectors = self._split_seq_length_dim_to( + query_vectors, + -1, + self.chunk_length, + self.num_attention_heads, + self.attention_head_size, + ) + key_vectors = self._split_seq_length_dim_to( + key_vectors, + -1, + self.chunk_length, + self.num_attention_heads, + self.attention_head_size, + ) + value_vectors = self._split_seq_length_dim_to( + value_vectors, + -1, + self.chunk_length, + self.num_attention_heads, + self.attention_head_size, + ) + + # chunk indices + query_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads) + key_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads) + + # append chunks before and after + key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after) + value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after) + key_indices = self._look_adjacent(key_indices, self.num_chunks_before, self.num_chunks_after) + else: + query_indices = key_indices = indices + + # query-key matmul: QK^T + query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2)) + + # free memory + del query_vectors, key_vectors + + mask = self._compute_attn_mask( + query_indices, key_indices, attention_mask, query_key_dots.shape, do_standard_self_attention + ) + + if mask is not None: + # get mask tensor depending on half precision or not + if query_key_dots.dtype == torch.float16: + mask_value = self.mask_value_float16.half() + else: + mask_value = self.mask_value_float32 + + query_key_dots = torch.where(mask, query_key_dots, mask_value) + + # free memory + del mask + + # softmax + logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True) + attention_probs = torch.exp(query_key_dots - logits) + + # free memory + del logits + + # dropout + attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + # attend values + out_vectors = torch.matmul(attention_probs, value_vectors) + + # free memory + del value_vectors + + # merge chunk length + if not do_standard_self_attention: + out_vectors = out_vectors.flatten(start_dim=2, end_dim=3) + + assert out_vectors.shape == ( + batch_size, + self.num_attention_heads, + sequence_length, + self.attention_head_size, + ) + + out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size) + + if output_attentions is False: + attention_probs = () + + return LocalSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs) + + def _compute_attn_mask( + self, query_indices, key_indices, attention_mask, query_key_dots_shape, do_standard_self_attention + ): + # chunk attention mask and look before and after + if attention_mask is not None: + attention_mask = attention_mask.to(torch.bool)[:, None, :] + + if not do_standard_self_attention: + attention_mask = self._split_seq_length_dim_to(attention_mask, -1, self.chunk_length, 1) + attention_mask = self._look_adjacent(attention_mask, self.num_chunks_before, self.num_chunks_after) + # create attn_mask + attention_mask = attention_mask.unsqueeze(-2).expand(query_key_dots_shape) + + # Causal mask + if self.is_decoder is True: + causal_mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device) + + # add attention mask if not None + if attention_mask is not None: + attention_mask = causal_mask * attention_mask + else: + attention_mask = causal_mask + + return attention_mask + + @staticmethod + def _retrieve_relevant_hidden_states(previous_hidden_states, chunk_length, num_chunks_before): + start_position = ((previous_hidden_states.shape[1] // chunk_length) - num_chunks_before) * chunk_length + return previous_hidden_states[:, start_position:] + + +class ReformerSelfOutput(nn.Module): + def __init__(self, config): + super().__init__() + all_head_size = config.num_attention_heads * config.attention_head_size + self.dropout = config.hidden_dropout_prob + + self.dense = nn.Linear(all_head_size, config.hidden_size, bias=False) + + def forward(self, hidden_states): + hidden_states = self.dense(hidden_states) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + return hidden_states + + +class ReformerAttention(nn.Module): + def __init__(self, config, layer_id=0): + super().__init__() + self.layer_id = layer_id + self.attn_layers = config.attn_layers + + self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + + if len(set(self.attn_layers)) == 1 and self.attn_layers[0] == "lsh": + self.self_attention = LSHSelfAttention(config) + elif len(set(self.attn_layers)) == 1 and self.attn_layers[0] == "local": + self.self_attention = LocalSelfAttention(config) + elif len(set(self.attn_layers)) == 2 and set(self.attn_layers) == {"lsh", "local"}: + # get correct attn layers + if self.attn_layers[self.layer_id] == "lsh": + self.self_attention = LSHSelfAttention(config) + else: + self.self_attention = LocalSelfAttention(config) + else: + raise NotImplementedError( + f"Only attn layer types 'lsh' and 'local' exist, but got `config.attn_layers`: {self.attn_layers}. " + "Select attn layer types from ['lsh', 'local'] only." + ) + self.output = ReformerSelfOutput(config) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + num_hashes=None, + past_buckets_states=None, + use_cache=False, + orig_sequence_length=None, + output_attentions=False, + buckets=None, + ): + hidden_states = self.layer_norm(hidden_states) + + # make sure cached hidden states is set to None for backward pass + if past_buckets_states is not None: + past_buckets_states_layer = past_buckets_states[self.layer_id] + else: + past_buckets_states_layer = None + + # use cached buckets for backprob if buckets not None for LSHSelfAttention + self_attention_outputs = self.self_attention( + hidden_states=hidden_states, + head_mask=head_mask, + attention_mask=attention_mask, + num_hashes=num_hashes, + past_buckets_states=past_buckets_states_layer, + use_cache=use_cache, + output_attentions=output_attentions, + buckets=buckets, + ) + + # add buckets if necessary + if hasattr(self_attention_outputs, "buckets"): + buckets = self_attention_outputs.buckets + else: + buckets = None + + # cache hidden states for future use + if use_cache: + if past_buckets_states[self.layer_id][0] is None: + # padded input should not be cached + past_buckets = ( + buckets[:, :, :, :orig_sequence_length] + if (buckets is not None and orig_sequence_length > 1) + else buckets + ) + else: + past_buckets = torch.cat([past_buckets_states[self.layer_id][0], buckets], dim=-1) + + if past_buckets_states[self.layer_id][1] is None: + # padded input should not be cached + past_states = hidden_states[:, :orig_sequence_length] + else: + past_states = torch.cat([past_buckets_states[self.layer_id][1], hidden_states], dim=1) + + past_buckets_states[self.layer_id] = (past_buckets, past_states) + # compute attention feed forward output + attention_output = self.output(self_attention_outputs.hidden_states) + + return AttentionOutput( + hidden_states=attention_output, + attention_probs=self_attention_outputs.attention_probs, + buckets=buckets, + ) + + +class ReformerFeedForwardDense(nn.Module): + def __init__(self, config): + super().__init__() + self.dropout = config.hidden_dropout_prob + + if isinstance(config.hidden_act, str): + self.act_fn = ACT2FN[config.hidden_act] + else: + self.act_fn = config.hidden_act + + self.dense = nn.Linear(config.hidden_size, config.feed_forward_size) + + def forward(self, hidden_states): + hidden_states = self.dense(hidden_states) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = self.act_fn(hidden_states) + return hidden_states + + +class ReformerFeedForwardOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dropout = config.hidden_dropout_prob + + self.dense = nn.Linear(config.feed_forward_size, config.hidden_size) + + def forward(self, hidden_states): + hidden_states = self.dense(hidden_states) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + return hidden_states + + +class ChunkReformerFeedForward(nn.Module): + def __init__(self, config): + super().__init__() + self.chunk_size_feed_forward = config.chunk_size_feed_forward + self.seq_len_dim = 1 + + self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dense = ReformerFeedForwardDense(config) + self.output = ReformerFeedForwardOutput(config) + + def forward(self, attention_output): + return apply_chunking_to_forward( + self.forward_chunk, + self.chunk_size_feed_forward, + self.seq_len_dim, + attention_output, + ) + + def forward_chunk(self, hidden_states): + hidden_states = self.layer_norm(hidden_states) + hidden_states = self.dense(hidden_states) + return self.output(hidden_states) + + +class ReformerLayer(nn.Module): + def __init__(self, config, layer_id=0): + super().__init__() + self.attention = ReformerAttention(config, layer_id) + # dropout requires to have the same + # seed for forward and backward pass + self.attention_seed = None + self.feed_forward_seed = None + + self.feed_forward = ChunkReformerFeedForward(config) + + def _init_attention_seed(self): + """ + This function sets a new seed for the attention layer to make dropout deterministic for both forward calls: 1 + normal forward call and 1 forward call in backward to recalculate activations. + """ + + # randomize seeds + # use cuda generator if available + if hasattr(torch.cuda, "default_generators") and len(torch.cuda.default_generators) > 0: + # GPU + device_idx = torch.cuda.current_device() + self.attention_seed = torch.cuda.default_generators[device_idx].seed() + else: + # CPU + self.attention_seed = int(torch.seed() % sys.maxsize) + + torch.manual_seed(self.attention_seed) + + def _init_feed_forward_seed(self): + """ + This function sets a new seed for the feed forward layer to make dropout deterministic for both forward calls: + 1 normal forward call and 1 forward call in backward to recalculate activations. + """ + # randomize seeds + # use cuda generator if available + if hasattr(torch.cuda, "default_generators") and len(torch.cuda.default_generators) > 0: + # GPU + device_idx = torch.cuda.current_device() + self.feed_forward_seed = torch.cuda.default_generators[device_idx].seed() + else: + # CPU + self.feed_forward_seed = int(torch.seed() % sys.maxsize) + + torch.manual_seed(self.feed_forward_seed) + + def forward( + self, + prev_attn_output, + hidden_states, + attention_mask=None, + head_mask=None, + num_hashes=None, + past_buckets_states=None, + use_cache=False, + orig_sequence_length=None, + output_attentions=False, + ): + with torch.no_grad(): + # every forward pass we sample a different seed + # for dropout and save for forward fn in backward pass + # to have correct dropout + if self.training: + self._init_attention_seed() + + attn_outputs = self.attention( + hidden_states=hidden_states, + head_mask=head_mask, + attention_mask=attention_mask, + num_hashes=num_hashes, + past_buckets_states=past_buckets_states, + use_cache=use_cache, + orig_sequence_length=orig_sequence_length, + output_attentions=output_attentions, + ) + attn_output = attn_outputs.hidden_states + + # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0) + # Y_1 = X_1 + f(X_2) + attn_output = prev_attn_output + attn_output + + # free memory + del prev_attn_output + + # every forward pass we sample a different seed + # for dropout and save seed for forward fn in backward + # to have correct dropout + if self.training: + self._init_feed_forward_seed() + # Y_2 = X_2 + g(Y_1) + hidden_states = hidden_states + self.feed_forward(attn_output) + + return ReformerOutput( + attn_output=attn_output, + hidden_states=hidden_states, + attention_probs=attn_outputs.attention_probs, + buckets=attn_outputs.buckets, + ) + + def backward_pass( + self, + next_attn_output, + hidden_states, + grad_attn_output, + grad_hidden_states, + attention_mask=None, + head_mask=None, + buckets=None, + ): + # Implements the backward pass for reversible ResNets. + # A good blog post on how this works can be found here: + # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0) + # This code is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py + + assert self.training, ( + "If you want to train `ReformerModel` and its variations, make sure to use `model.train()` to put the" + " model into training mode." + ) + + with torch.enable_grad(): + next_attn_output.requires_grad = True + + # set seed to have correct dropout + torch.manual_seed(self.feed_forward_seed) + # g(Y_1) + res_hidden_states = self.feed_forward(next_attn_output) + res_hidden_states.backward(grad_hidden_states, retain_graph=True) + + with torch.no_grad(): + # X_2 = Y_2 - g(Y_1) + hidden_states = hidden_states - res_hidden_states + del res_hidden_states + + grad_attn_output = grad_attn_output + next_attn_output.grad + next_attn_output.grad = None + + with torch.enable_grad(): + hidden_states.requires_grad = True + + # set seed to have correct dropout + torch.manual_seed(self.attention_seed) + # f(X_2) + # use cached buckets for backprob if buckets not None for LSHSelfAttention + output = self.attention( + hidden_states=hidden_states, + head_mask=head_mask, + attention_mask=attention_mask, + buckets=buckets, + ).hidden_states + output.backward(grad_attn_output, retain_graph=True) + + with torch.no_grad(): + # X_1 = Y_1 - f(X_2) + attn_output = next_attn_output - output + del output, next_attn_output + + grad_hidden_states = grad_hidden_states + hidden_states.grad + hidden_states.grad = None + hidden_states = hidden_states.detach() + + return ReformerBackwardOutput( + attn_output=attn_output, + hidden_states=hidden_states, + grad_attn_output=grad_attn_output, + grad_hidden_states=grad_hidden_states, + ) + + +class _ReversibleFunction(Function): + """ + To prevent PyTorch from performing the usual backpropagation, a customized backward function is implemented here. + This way it is made sure that no memory expensive activations are saved during the forward pass. This function is + heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py + """ + + @staticmethod + def forward( + ctx, + hidden_states, + layers, + attention_mask, + head_mask, + num_hashes, + all_hidden_states, + all_attentions, + past_buckets_states, + use_cache, + orig_sequence_length, + output_hidden_states, + output_attentions, + ): + all_buckets = () + + # split duplicated tensor + hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1) + + for layer_id, (layer, layer_head_mask) in enumerate(zip(layers, head_mask)): + if output_hidden_states is True: + all_hidden_states.append(hidden_states) + + layer_outputs = layer( + prev_attn_output=attn_output, + hidden_states=hidden_states, + attention_mask=attention_mask, + head_mask=layer_head_mask, + num_hashes=num_hashes, + past_buckets_states=past_buckets_states, + use_cache=use_cache, + orig_sequence_length=orig_sequence_length, + output_attentions=output_attentions, + ) + + attn_output = layer_outputs.attn_output + hidden_states = layer_outputs.hidden_states + all_buckets = all_buckets + (layer_outputs.buckets,) + + if output_attentions: + all_attentions.append(layer_outputs.attention_probs) + + # Add last layer + if output_hidden_states is True: + all_hidden_states.append(hidden_states) + + # attach params to ctx for backward + ctx.save_for_backward(attn_output.detach(), hidden_states.detach()) + ctx.layers = layers + ctx.all_buckets = all_buckets + ctx.head_mask = head_mask + ctx.attention_mask = attention_mask + + # Concatenate 2 RevNet outputs + return torch.cat([attn_output, hidden_states], dim=-1) + + @staticmethod + def backward(ctx, grad_hidden_states): + grad_attn_output, grad_hidden_states = torch.chunk(grad_hidden_states, 2, dim=-1) + + # retrieve params from ctx for backward + attn_output, hidden_states = ctx.saved_tensors + + # create tuple + output = ReformerBackwardOutput( + attn_output=attn_output, + hidden_states=hidden_states, + grad_attn_output=grad_attn_output, + grad_hidden_states=grad_hidden_states, + ) + + # free memory + del grad_attn_output, grad_hidden_states, attn_output, hidden_states + + layers = ctx.layers + all_buckets = ctx.all_buckets + head_mask = ctx.head_mask + attention_mask = ctx.attention_mask + + for idx, layer in enumerate(layers[::-1]): + # pop last buckets from stack + buckets = all_buckets[-1] + all_buckets = all_buckets[:-1] + + # backprop + output = layer.backward_pass( + next_attn_output=output.attn_output, + hidden_states=output.hidden_states, + grad_attn_output=output.grad_attn_output, + grad_hidden_states=output.grad_hidden_states, + head_mask=head_mask[len(layers) - idx - 1], + attention_mask=attention_mask, + buckets=buckets, + ) + + assert all_buckets == (), "buckets have to be empty after backpropagation" + grad_hidden_states = torch.cat([output.grad_attn_output, output.grad_hidden_states], dim=-1) + + # num of return vars has to match num of forward() args + # return gradient for hidden_states arg and None for other args + return grad_hidden_states, None, None, None, None, None, None, None, None, None, None, None + + +class ReformerEncoder(nn.Module): + def __init__(self, config): + super().__init__() + self.dropout = config.hidden_dropout_prob + + self.layers = nn.ModuleList([ReformerLayer(config, i) for i in range(config.num_hidden_layers)]) + # Reformer is using Rev Nets, thus last layer outputs are concatenated and + # Layer Norm is done over 2 * hidden_size + self.layer_norm = nn.LayerNorm(2 * config.hidden_size, eps=config.layer_norm_eps) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + num_hashes=None, + past_buckets_states=None, + use_cache=False, + orig_sequence_length=None, + output_hidden_states=False, + output_attentions=False, + ): + # hidden_states and attention lists to be filled if wished + all_hidden_states = [] + all_attentions = [] + + # init cached hidden states if necessary + if past_buckets_states is None: + past_buckets_states = [((None), (None)) for i in range(len(self.layers))] + + # concat same tensor for reversible ResNet + hidden_states = torch.cat([hidden_states, hidden_states], dim=-1) + hidden_states = _ReversibleFunction.apply( + hidden_states, + self.layers, + attention_mask, + head_mask, + num_hashes, + all_hidden_states, + all_attentions, + past_buckets_states, + use_cache, + orig_sequence_length, + output_hidden_states, + output_attentions, + ) + + # Apply layer norm to concatenated hidden states + hidden_states = self.layer_norm(hidden_states) + + # Apply dropout + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + + return ReformerEncoderOutput( + hidden_states=hidden_states, + all_hidden_states=all_hidden_states, + all_attentions=all_attentions, + past_buckets_states=past_buckets_states, + ) + + +class ReformerOnlyLMHead(nn.Module): + def __init__(self, config): + super().__init__() + # Reformer is using Rev Nets, thus last layer outputs are concatenated and + # Layer Norm is done over 2 * hidden_size + self.seq_len_dim = 1 + self.chunk_size_lm_head = config.chunk_size_lm_head + self.decoder = nn.Linear(2 * config.hidden_size, config.vocab_size, bias=False) + self.bias = nn.Parameter(torch.zeros(config.vocab_size)) + self.decoder.bias = self.bias + + def forward(self, hidden_states): + return apply_chunking_to_forward(self.forward_chunk, self.chunk_size_lm_head, self.seq_len_dim, hidden_states) + + def forward_chunk(self, hidden_states): + hidden_states = self.decoder(hidden_states) + return hidden_states + + def _tie_weights(self): + # To tie those two weights if they get disconnected (on TPU or when the bias is resized) + self.bias = self.decoder.bias + + +class ReformerPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = ReformerConfig + base_model_prefix = "reformer" + + @property + def dummy_inputs(self): + input_ids = torch.tensor(DUMMY_INPUTS) + input_mask = torch.tensor(DUMMY_MASK) + dummy_inputs = { + "input_ids": input_ids, + "attention_mask": input_mask, + } + return dummy_inputs + + def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, AxialPositionEmbeddings): + for weight in module.weights: + nn.init.normal_(weight, std=self.config.axial_norm_std) + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.Linear): + # Slightly different from the TF version which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + +@dataclass +class ReformerModelOutput(ModelOutput): + """ + Output type of [`ReformerModel`]. + + Args: + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_predict, hidden_size)`): + Sequence of hidden-states at the last layer of the model. + + `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` + corresponds to `sequence_length`. + past_buckets_states (`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): + List of `Tuple(torch.LongTensor, torch.FloatTensor` of length `config.n_layers`, with the first element + being the previous *buckets* of shape `(batch_size, num_heads, num_hashes, sequence_length)`) and the + second being the previous *hidden_states* of shape `(batch_size, sequence_length, hidden_size)`). + + Contains precomputed buckets and hidden-states that can be used (see `past_buckets_states` input) to speed + up sequential decoding. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings and one for the output of each layer) of + shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + last_hidden_state: torch.FloatTensor + past_buckets_states: Optional[List[Tuple[torch.LongTensor, torch.FloatTensor]]] = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +@dataclass +class ReformerModelWithLMHeadOutput(ModelOutput): + """ + Output type of [`ReformerModelWithLMHead`]. + + Args: + loss (`torch.FloatTensor` of shape *(1,)*, *optional*, returned when `labels` is provided) + Language modeling loss (for next-token prediction). + logits (`torch.FloatTensor` of shape `(batch_size, num_predict, config.vocab_size)`): + Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). + + `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` + corresponds to `sequence_length`. + past_buckets_states (`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): + List of `Tuple(torch.LongTensor, torch.FloatTensor` of length `config.n_layers`, with the first element + being the previous *buckets* of shape `(batch_size, num_heads, num_hashes, sequence_length)`) and the + second being the previous *hidden_states* of shape `(batch_size, sequence_length, hidden_size)`). + + Contains precomputed buckets and hidden-states that can be used (see `past_buckets_states` input) to speed + up sequential decoding. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + TTuple of `torch.FloatTensor` (one for the output of the embeddings and one for the output of each layer) + of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + loss: Optional[torch.FloatTensor] = None + logits: torch.FloatTensor = None + past_buckets_states: Optional[List[Tuple[torch.LongTensor, torch.FloatTensor]]] = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +REFORMER_START_DOCSTRING = r""" + Reformer was proposed in [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, + Łukasz Kaiser, Anselm Levskaya. + + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`ReformerConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +REFORMER_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. During training the input_ids sequence_length has to be + a multiple of the relevant model's chunk lengths (lsh's, local's or both). During evaluation, the indices + are automatically padded to be a multiple of the chunk length. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert `input_ids` indices into associated vectors than the + model's internal embedding lookup matrix. + num_hashes (`int`, *optional*): + The number of hashing rounds that should be performed during bucketing. Setting this argument overwrites + the default defined in `config.num_hashes`. + + For more information, see `num_hashes` in [`ReformerConfig`]. + past_buckets_states (`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, *optional*): + List of `Tuple(torch.LongTensor, torch.FloatTensor` of length `config.n_layers`, with the first element + being the previous *buckets* of shape `(batch_size, num_heads, num_hashes, sequence_length)`) and the + second being the previous *hidden_states* of shape `(batch_size, sequence_length, hidden_size)`). + + Contains precomputed hidden-states and buckets (only relevant for LSH Self-Attention). Can be used to speed + up sequential decoding. + use_cache (`bool`, *optional*): + If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see + `past_key_values`). + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@add_start_docstrings( + "The bare Reformer Model transformer outputting raw hidden-stateswithout any specific head on top.", + REFORMER_START_DOCSTRING, +) +class ReformerModel(ReformerPreTrainedModel): + def __init__(self, config): + super().__init__(config) + self.config = config + assert ( + self.config.num_hidden_layers > 0 + ), "`config.attn_layers` is empty. Select at least one attn layer form ['lsh', 'local']" + + self.embeddings = ReformerEmbeddings(config) + self.encoder = ReformerEncoder(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.embeddings.word_embeddings + + def set_input_embeddings(self, value): + self.embeddings.word_embeddings = value + + def _prune_heads(self, heads_to_prune): + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base + class PreTrainedModel + """ + for layer, heads in heads_to_prune.items(): + self.encoder.layer[layer].attention.prune_heads(heads) + + @add_start_docstrings_to_model_forward(REFORMER_INPUTS_DOCSTRING) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=ReformerModelOutput, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + num_hashes: Optional[int] = None, + past_buckets_states: Optional[List[Tuple[torch.Tensor]]] = None, + use_cache: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ReformerModelOutput]: + use_cache = use_cache if use_cache is not None else self.config.use_cache + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() # noqa: F841 + device = input_ids.device + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] # noqa: F841 + device = inputs_embeds.device + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + assert ( + len(input_shape) == 2 + ), f"`input_ids` have be of shape `[batch_size, sequence_length]`, but got shape: {input_shape}" + + if past_buckets_states is not None: + assert not self.training, "`past_buckets_states` can only be used for inference, not for training`." + + # prepare head mask + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers, is_attention_chunked=True) + + # original sequence length for padding + orig_sequence_length = input_shape[-1] + + # if needs padding + least_common_mult_chunk_length = _get_least_common_mult_chunk_len(self.config) + min_chunk_length = _get_min_chunk_len(self.config) + + must_pad_to_match_chunk_length = ( + input_shape[-1] % least_common_mult_chunk_length != 0 + and input_shape[-1] > min_chunk_length + and past_buckets_states is None + ) + + if must_pad_to_match_chunk_length: + padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length + + if self.training is True: + raise ValueError( + f"If training, sequence length {input_shape[-1]} has to be a multiple of least common multiple " + f"chunk_length {least_common_mult_chunk_length}. Please consider padding the input to a length " + f"of {input_shape[-1] + padding_length}." + ) + + # pad input + input_ids, inputs_embeds, attention_mask, position_ids, input_shape = self._pad_to_mult_of_chunk_length( + input_ids, + inputs_embeds=inputs_embeds, + attention_mask=attention_mask, + position_ids=position_ids, + input_shape=input_shape, + padding_length=padding_length, + padded_seq_length=least_common_mult_chunk_length, + device=device, + ) + + # start index for position encoding depends on incremental decoding + if past_buckets_states is not None: + start_idx_pos_encodings = past_buckets_states[0][1].shape[1] + else: + start_idx_pos_encodings = 0 + + embedding_output = self.embeddings( + input_ids=input_ids, + position_ids=position_ids, + inputs_embeds=inputs_embeds, + start_idx_pos_encodings=start_idx_pos_encodings, + ) + + encoder_outputs = self.encoder( + hidden_states=embedding_output, + head_mask=head_mask, + attention_mask=attention_mask, + num_hashes=num_hashes, + past_buckets_states=past_buckets_states, + use_cache=use_cache, + orig_sequence_length=orig_sequence_length, + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + ) + sequence_output = encoder_outputs.hidden_states + + # if padding was applied + if must_pad_to_match_chunk_length: + sequence_output = sequence_output[:, :orig_sequence_length] + + past_buckets_states = encoder_outputs.past_buckets_states if use_cache else None + hidden_states = encoder_outputs.all_hidden_states if output_hidden_states else None + attentions = encoder_outputs.all_attentions if output_attentions else None + + if not return_dict: + return tuple(v for v in [sequence_output, past_buckets_states, hidden_states, attentions] if v is not None) + return ReformerModelOutput( + last_hidden_state=sequence_output, + past_buckets_states=past_buckets_states, + hidden_states=hidden_states, + attentions=attentions, + ) + + def _pad_to_mult_of_chunk_length( + self, + input_ids, + inputs_embeds=None, + attention_mask=None, + position_ids=None, + input_shape=None, + padding_length=None, + padded_seq_length=None, + device=None, + ): + logger.info( + f"Input ids are automatically padded from {input_shape[-1]} to {input_shape[-1] + padding_length} to be a " + f"multiple of `config.chunk_length`: {padded_seq_length}" + ) + + padded_input_ids = torch.full( + (input_shape[0], padding_length), + self.config.pad_token_id, + device=device, + dtype=torch.long, + ) + + # Extend `attention_mask` + if attention_mask is not None: + pad_attention_mask = torch.zeros(input_shape[0], padding_length, device=device, dtype=attention_mask.dtype) + + attention_mask = torch.cat([attention_mask, pad_attention_mask], dim=-1) + else: + attention_mask = torch.cat( + [ + torch.ones(input_shape, device=device, dtype=torch.bool), + torch.zeros((input_shape[0], padding_length), device=device, dtype=torch.bool), + ], + dim=-1, + ) + + # Extend `input_ids` with padding to match least common multiple chunk_length + if input_ids is not None: + input_ids = torch.cat([input_ids, padded_input_ids], dim=-1) + input_shape = input_ids.size() + + # Pad position ids if given + if position_ids is not None: + padded_position_ids = torch.arange(input_shape[-1], padded_seq_length, dtype=torch.long, device=device) + padded_position_ids = position_ids.unsqueeze(0).expand(input_shape[0], padding_length) + position_ids = torch.cat([position_ids, padded_position_ids], dim=-1) + + # Extend `inputs_embeds` with padding to match least common multiple chunk_length + if inputs_embeds is not None: + padded_inputs_embeds = self.embeddings(padded_input_ids, position_ids) + inputs_embeds = torch.cat([inputs_embeds, padded_inputs_embeds], dim=-2) + input_shape = inputs_embeds.size() + return input_ids, inputs_embeds, attention_mask, position_ids, input_shape + + +@add_start_docstrings("""Reformer Model with a `language modeling` head on top.""", REFORMER_START_DOCSTRING) +class ReformerModelWithLMHead(ReformerPreTrainedModel): + _keys_to_ignore_on_load_missing = ["lm_head.decoder.bias"] + + def __init__(self, config): + super().__init__(config) + assert config.is_decoder, "If you want to use `ReformerModelWithLMHead` make sure that `is_decoder=True`." + assert "local" not in self.config.attn_layers or config.local_num_chunks_after == 0, ( + "If causal mask is enabled, make sure that `config.local_num_chunks_after` is set to 0 and not" + f" {config.local_num_chunks_after}." + ) + assert "lsh" not in self.config.attn_layers or config.lsh_num_chunks_after == 0, ( + "If causal mask is enabled, make sure that `config.lsh_num_chunks_after` is set to 1 and not" + f" {config.lsh_num_chunks_after}." + ) + + self.reformer = ReformerModel(config) + self.lm_head = ReformerOnlyLMHead(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_output_embeddings(self): + return self.lm_head.decoder + + def set_output_embeddings(self, new_embeddings): + self.lm_head.decoder = new_embeddings + + @add_start_docstrings_to_model_forward(REFORMER_INPUTS_DOCSTRING) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=CausalLMOutput, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + num_hashes: Optional[int] = None, + past_buckets_states: Optional[List[Tuple[torch.Tensor]]] = None, + use_cache: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + return_dict: Optional[bool] = None, + labels: Optional[torch.Tensor] = None, + ) -> Union[Tuple, CausalLMOutput]: + r""" + labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): + Labels for computing the sequence classification/regression loss. Indices should be in `[-100, 0, ..., + config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for + labels in `[0, ..., config.vocab_size]` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + reformer_outputs = self.reformer( + input_ids, + position_ids=position_ids, + attention_mask=attention_mask, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + num_hashes=num_hashes, + past_buckets_states=past_buckets_states, + use_cache=use_cache, + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + return_dict=return_dict, + ) + + sequence_output = reformer_outputs[0] + logits = self.lm_head(sequence_output) + + loss = None + if labels is not None: + # Shift so that tokens < n predict n + shift_logits = logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous() + # Flatten the tokens + loss_fct = CrossEntropyLoss() + loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1)) + + if not return_dict: + output = (logits,) + reformer_outputs[1:] + return ((loss,) + output) if loss is not None else output + + return ReformerModelWithLMHeadOutput( + loss=loss, + logits=logits, + past_buckets_states=reformer_outputs.past_buckets_states, + hidden_states=reformer_outputs.hidden_states, + attentions=reformer_outputs.attentions, + ) + + def prepare_inputs_for_generation( + self, input_ids, past_key_values=None, use_cache=None, num_hashes=None, **kwargs + ): + # only last token for inputs_ids if past is defined in kwargs + if past_key_values is not None: + input_ids = input_ids[:, -1:] + + inputs_dict = { + "input_ids": input_ids, + "past_buckets_states": past_key_values, + "use_cache": use_cache, + "num_hashes": num_hashes, + } + + return inputs_dict + + def _reorder_cache(self, past_key_values, beam_idx): + reord_past_buckets_states = [] + for layer_past in past_key_values: + # buckets + if layer_past[0] is not None: + reord_buckets = layer_past[0].index_select(0, beam_idx) + else: + reord_buckets = None + + # hidden states + reord_hidden_states = layer_past[1].index_select(0, beam_idx) + reord_past_buckets_states.append((reord_buckets, reord_hidden_states)) + return reord_past_buckets_states + + +@add_start_docstrings("""Reformer Model with a `language modeling` head on top.""", REFORMER_START_DOCSTRING) +class ReformerForMaskedLM(ReformerPreTrainedModel): + def __init__(self, config): + super().__init__(config) + assert not config.is_decoder, ( + "If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional" + " self-attention." + ) + self.reformer = ReformerModel(config) + self.lm_head = ReformerOnlyLMHead(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_output_embeddings(self): + return self.lm_head.decoder + + def set_output_embeddings(self, new_embeddings): + self.lm_head.decoder = new_embeddings + + @add_start_docstrings_to_model_forward(REFORMER_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=MaskedLMOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + num_hashes: Optional[int] = None, + labels: Optional[torch.Tensor] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, MaskedLMOutput]: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., + config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), + the loss is only computed for the tokens with labels + + Returns: + + + + This example uses a false checkpoint since we don't have any available pretrained model for the masked language + modeling task with the Reformer architecture. + + + + Example: + + ```python + >>> import torch + >>> from transformers import AutoTokenizer, ReformerForMaskedLM + + >>> tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-reformer") + >>> model = ReformerForMaskedLM.from_pretrained("hf-internal-testing/tiny-random-reformer") + + >>> # add mask_token + >>> tokenizer.add_special_tokens({"mask_token": "[MASK]"}) # doctest: +IGNORE_RESULT + >>> inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") + + >>> # resize model's embedding matrix + >>> model.resize_token_embeddings(new_num_tokens=model.config.vocab_size + 1) # doctest: +IGNORE_RESULT + + >>> with torch.no_grad(): + ... logits = model(**inputs).logits + + >>> # retrieve index of [MASK] + >>> mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0] + + >>> predicted_token_id = logits[0, mask_token_index].argmax(axis=-1) + >>> predicted_token = tokenizer.decode(predicted_token_id) + ``` + + ```python + >>> labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"] + >>> # mask labels of non-[MASK] tokens + >>> labels = torch.where( + ... inputs.input_ids == tokenizer.mask_token_id, labels[:, : inputs["input_ids"].shape[-1]], -100 + ... ) + + >>> outputs = model(**inputs, labels=labels) + >>> loss = round(outputs.loss.item(), 2) + ``` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + reformer_outputs = self.reformer( + input_ids, + position_ids=position_ids, + attention_mask=attention_mask, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + num_hashes=num_hashes, + use_cache=False, # no causal mask + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + return_dict=return_dict, + ) + + sequence_output = reformer_outputs[0] + logits = self.lm_head(sequence_output) + + masked_lm_loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() # -100 index = padding token + masked_lm_loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1)) + + if not return_dict: + output = (logits,) + reformer_outputs[1:] + return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output + + return MaskedLMOutput( + loss=masked_lm_loss, + logits=logits, + hidden_states=reformer_outputs.hidden_states, + attentions=reformer_outputs.attentions, + ) + + +@add_start_docstrings( + """ + Reformer Model transformer with a sequence classification/regression head on top (a linear layer on top of the + pooled output) e.g. for GLUE tasks. + """, + REFORMER_START_DOCSTRING, +) +class ReformerForSequenceClassification(ReformerPreTrainedModel): + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_labels + self.config = config + + self.reformer = ReformerModel(config) + self.classifier = ReformerClassificationHead(config) + if config.is_decoder is True: + logger.warning("You might want to disable causal masking for sequence classification") + + # Initialize weights and apply final processing + self.post_init() + + @add_start_docstrings_to_model_forward(REFORMER_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=SequenceClassifierOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + num_hashes: Optional[int] = None, + labels: Optional[torch.Tensor] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, SequenceClassifierOutput]: + r""" + labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): + Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., + config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If + `config.num_labels > 1` a classification loss is computed (Cross-Entropy). + + Returns: + + Example of single-label classification: + + ```python + >>> import torch + >>> from transformers import AutoTokenizer, ReformerForSequenceClassification + + >>> tokenizer = AutoTokenizer.from_pretrained("google/reformer-crime-and-punishment") + >>> model = ReformerForSequenceClassification.from_pretrained("google/reformer-crime-and-punishment") + + >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") + + >>> with torch.no_grad(): + ... logits = model(**inputs).logits + + >>> predicted_class_id = logits.argmax().item() + >>> label = model.config.id2label[predicted_class_id] + ``` + + ```python + >>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)` + >>> num_labels = len(model.config.id2label) + >>> model = ReformerForSequenceClassification.from_pretrained( + ... "google/reformer-crime-and-punishment", num_labels=num_labels + ... ) + + >>> labels = torch.tensor(1) + >>> loss = model(**inputs, labels=labels).loss + ``` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.reformer( + input_ids, + position_ids=position_ids, + attention_mask=attention_mask, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + num_hashes=num_hashes, + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + return_dict=return_dict, + ) + + sequence_output = outputs[0] + logits = self.classifier(sequence_output) + + loss = None + if labels is not None: + if self.config.problem_type is None: + if self.num_labels == 1: + self.config.problem_type = "regression" + elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): + self.config.problem_type = "single_label_classification" + else: + self.config.problem_type = "multi_label_classification" + + if self.config.problem_type == "regression": + loss_fct = MSELoss() + if self.num_labels == 1: + loss = loss_fct(logits.squeeze(), labels.squeeze()) + else: + loss = loss_fct(logits, labels) + elif self.config.problem_type == "single_label_classification": + loss_fct = CrossEntropyLoss() + loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) + elif self.config.problem_type == "multi_label_classification": + loss_fct = BCEWithLogitsLoss() + loss = loss_fct(logits, labels) + + if not return_dict: + output = (logits,) + outputs[2:] + return ((loss,) + output) if loss is not None else output + + return SequenceClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + +class ReformerClassificationHead(nn.Module): + """Head for sentence-level classification tasks.""" + + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(2 * config.hidden_size, config.hidden_size) + classifier_dropout = ( + config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob + ) + self.dropout = nn.Dropout(classifier_dropout) + self.out_proj = nn.Linear(config.hidden_size, config.num_labels) + + def forward(self, hidden_states, **kwargs): + hidden_states = hidden_states[:, 0, :] # take token (equiv. to [CLS]) + hidden_states = self.dropout(hidden_states) + hidden_states = self.dense(hidden_states) + hidden_states = torch.tanh(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.out_proj(hidden_states) + return hidden_states + + +@add_start_docstrings( + """ + Reformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA + ( a linear layer on top of hidden-states output to compute `span start logits` and `span end logits`. + """, + REFORMER_START_DOCSTRING, +) +class ReformerForQuestionAnswering(ReformerPreTrainedModel): + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_labels + + self.reformer = ReformerModel(config) + # 2 * config.hidden_size because we use reversible residual layers + self.qa_outputs = nn.Linear(2 * config.hidden_size, config.num_labels) + + # Initialize weights and apply final processing + self.post_init() + + @add_start_docstrings_to_model_forward(REFORMER_INPUTS_DOCSTRING) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=QuestionAnsweringModelOutput, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + num_hashes: Optional[int] = None, + start_positions: Optional[torch.Tensor] = None, + end_positions: Optional[torch.Tensor] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, QuestionAnsweringModelOutput]: + r""" + start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*): + Labels for position (index) of the start of the labelled span for computing the token classification loss. + Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence + are not taken into account for computing the loss. + end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*): + Labels for position (index) of the end of the labelled span for computing the token classification loss. + Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence + are not taken into account for computing the loss. + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + reformer_outputs = self.reformer( + input_ids, + position_ids=position_ids, + attention_mask=attention_mask, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + num_hashes=num_hashes, + use_cache=False, # no causal mask + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + return_dict=return_dict, + ) + + sequence_output = reformer_outputs[0] + + logits = self.qa_outputs(sequence_output) + start_logits, end_logits = logits.split(1, dim=-1) + start_logits = start_logits.squeeze(-1).contiguous() + end_logits = end_logits.squeeze(-1).contiguous() + + total_loss = None + if start_positions is not None and end_positions is not None: + # If we are on multi-GPU, split add a dimension + if len(start_positions.size()) > 1: + start_positions = start_positions.squeeze(-1) + if len(end_positions.size()) > 1: + end_positions = end_positions.squeeze(-1) + # sometimes the start/end positions are outside our model inputs, we ignore these terms + ignored_index = start_logits.size(1) + start_positions = start_positions.clamp(0, ignored_index) + end_positions = end_positions.clamp(0, ignored_index) + + loss_fct = CrossEntropyLoss(ignore_index=ignored_index) + start_loss = loss_fct(start_logits, start_positions) + end_loss = loss_fct(end_logits, end_positions) + total_loss = (start_loss + end_loss) / 2 + + if not return_dict: + output = (start_logits, end_logits) + reformer_outputs[1:] + return ((total_loss,) + output) if total_loss is not None else output + + return QuestionAnsweringModelOutput( + loss=total_loss, + start_logits=start_logits, + end_logits=end_logits, + hidden_states=reformer_outputs.hidden_states, + attentions=reformer_outputs.attentions, + ) diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer.py b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer.py new file mode 100644 index 0000000000..6e7d5ecc8f --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer.py @@ -0,0 +1,3892 @@ +# coding=utf-8 +# Copyright 2020-present the HuggingFace Inc. team. +# Copyright 2023 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +The Trainer class, to easily train a 🤗 Transformers from scratch or finetune it on a new task. +""" + +import contextlib +import functools +import glob +import inspect +import math +import os +import random +import re +import shutil +import sys +import time +import warnings +from collections.abc import Mapping +from pathlib import Path +from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union + +from tqdm.auto import tqdm + + +# Integrations must be imported before ML frameworks: +# isort: off +from .integrations import ( + default_hp_search_backend, + get_reporting_integration_callbacks, + hp_params, + is_fairscale_available, + is_optuna_available, + is_ray_tune_available, + is_sigopt_available, + is_wandb_available, + run_hp_search_optuna, + run_hp_search_ray, + run_hp_search_sigopt, + run_hp_search_wandb, +) + +# isort: on + +import numpy as np +import torch +import torch.distributed as dist +from huggingface_hub import Repository, create_repo +from packaging import version +from torch import nn +from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler +from torch.utils.data.distributed import DistributedSampler + +from . import __version__ +from .configuration_utils import PretrainedConfig +from .data.data_collator import DataCollator, DataCollatorWithPadding, default_data_collator +from .debug_utils import DebugOption, DebugUnderflowOverflow +from .deepspeed import deepspeed_init, is_deepspeed_zero3_enabled +from .dependency_versions_check import dep_version_check +from .modelcard import TrainingSummary +from .modeling_utils import PreTrainedModel, load_sharded_checkpoint, unwrap_model +from .models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES, MODEL_MAPPING_NAMES +from .optimization import Adafactor, get_scheduler +from .pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_10, is_torch_less_than_1_11 +from .tokenization_utils_base import PreTrainedTokenizerBase +from .trainer_callback import ( + CallbackHandler, + DefaultFlowCallback, + PrinterCallback, + ProgressCallback, + TrainerCallback, + TrainerControl, + TrainerState, +) +from .trainer_pt_utils import ( + DistributedLengthGroupedSampler, + DistributedSamplerWithLoop, + DistributedTensorGatherer, + IterableDatasetShard, + LabelSmoother, + LengthGroupedSampler, + SequentialDistributedSampler, + ShardSampler, + distributed_broadcast_scalars, + distributed_concat, + find_batch_size, + get_model_param_count, + get_module_class_from_name, + get_parameter_names, + nested_concat, + nested_detach, + nested_numpify, + nested_truncate, + nested_xla_mesh_reduce, + reissue_pt_warnings, +) +from .trainer_utils import ( + PREFIX_CHECKPOINT_DIR, + BestRun, + EvalLoopOutput, + EvalPrediction, + FSDPOption, + HPSearchBackend, + HubStrategy, + IntervalStrategy, + PredictionOutput, + RemoveColumnsCollator, + ShardedDDPOption, + TrainerMemoryTracker, + TrainOutput, + default_compute_objective, + default_hp_space, + denumpify_detensorize, + enable_full_determinism, + find_executable_batch_size, + get_last_checkpoint, + has_length, + number_of_arguments, + seed_worker, + set_seed, + speed_metrics, +) +from .training_args import OptimizerNames, ParallelMode, TrainingArguments +from .utils import ( + CONFIG_NAME, + SAFE_WEIGHTS_INDEX_NAME, + SAFE_WEIGHTS_NAME, + WEIGHTS_INDEX_NAME, + WEIGHTS_NAME, + can_return_loss, + find_labels, + get_full_repo_name, + is_accelerate_available, + is_apex_available, + is_datasets_available, + is_in_notebook, + is_ipex_available, + is_safetensors_available, + is_sagemaker_dp_enabled, + is_sagemaker_mp_enabled, + is_torch_compile_available, + is_torch_neuroncore_available, + is_torch_tpu_available, + logging, + strtobool, +) +from .utils.generic import ContextManagers + + +_is_native_cpu_amp_available = is_torch_greater_or_equal_than_1_10 + +DEFAULT_CALLBACKS = [DefaultFlowCallback] +DEFAULT_PROGRESS_CALLBACK = ProgressCallback + +if is_in_notebook(): + from .utils.notebook import NotebookProgressCallback + + DEFAULT_PROGRESS_CALLBACK = NotebookProgressCallback + +if is_apex_available(): + from apex import amp + +if is_datasets_available(): + import datasets + +if is_torch_tpu_available(check_device=False): + import torch_xla.core.xla_model as xm + import torch_xla.debug.metrics as met + import torch_xla.distributed.parallel_loader as pl + +if is_fairscale_available(): + dep_version_check("fairscale") + import fairscale + from fairscale.nn.data_parallel import FullyShardedDataParallel as FullyShardedDDP + from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP + from fairscale.nn.wrap import auto_wrap + from fairscale.optim import OSS + from fairscale.optim.grad_scaler import ShardedGradScaler + + +if is_sagemaker_mp_enabled(): + import smdistributed.modelparallel.torch as smp + from smdistributed.modelparallel import __version__ as SMP_VERSION + + IS_SAGEMAKER_MP_POST_1_10 = version.parse(SMP_VERSION) >= version.parse("1.10") + + from .trainer_pt_utils import smp_forward_backward, smp_forward_only, smp_gather, smp_nested_concat +else: + IS_SAGEMAKER_MP_POST_1_10 = False + + +if is_safetensors_available(): + import safetensors.torch + + +skip_first_batches = None +if is_accelerate_available(): + from accelerate import __version__ as accelerate_version + + if version.parse(accelerate_version) >= version.parse("0.16"): + from accelerate import skip_first_batches + + +if TYPE_CHECKING: + import optuna + +logger = logging.get_logger(__name__) + + +# Name of the files used for checkpointing +TRAINING_ARGS_NAME = "training_args.bin" +TRAINER_STATE_NAME = "trainer_state.json" +OPTIMIZER_NAME = "optimizer.pt" +SCHEDULER_NAME = "scheduler.pt" +SCALER_NAME = "scaler.pt" + + +class Trainer: + """ + Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers. + + Args: + model ([`PreTrainedModel`] or `torch.nn.Module`, *optional*): + The model to train, evaluate or use for predictions. If not provided, a `model_init` must be passed. + + + + [`Trainer`] is optimized to work with the [`PreTrainedModel`] provided by the library. You can still use + your own models defined as `torch.nn.Module` as long as they work the same way as the 🤗 Transformers + models. + + + + args ([`TrainingArguments`], *optional*): + The arguments to tweak for training. Will default to a basic instance of [`TrainingArguments`] with the + `output_dir` set to a directory named *tmp_trainer* in the current directory if not provided. + data_collator (`DataCollator`, *optional*): + The function to use to form a batch from a list of elements of `train_dataset` or `eval_dataset`. Will + default to [`default_data_collator`] if no `tokenizer` is provided, an instance of + [`DataCollatorWithPadding`] otherwise. + train_dataset (`torch.utils.data.Dataset` or `torch.utils.data.IterableDataset`, *optional*): + The dataset to use for training. If it is a [`~datasets.Dataset`], columns not accepted by the + `model.forward()` method are automatically removed. + + Note that if it's a `torch.utils.data.IterableDataset` with some randomization and you are training in a + distributed fashion, your iterable dataset should either use a internal attribute `generator` that is a + `torch.Generator` for the randomization that must be identical on all processes (and the Trainer will + manually set the seed of this `generator` at each epoch) or have a `set_epoch()` method that internally + sets the seed of the RNGs used. + eval_dataset (Union[`torch.utils.data.Dataset`, Dict[str, `torch.utils.data.Dataset`]), *optional*): + The dataset to use for evaluation. If it is a [`~datasets.Dataset`], columns not accepted by the + `model.forward()` method are automatically removed. If it is a dictionary, it will evaluate on each + dataset prepending the dictionary key to the metric name. + tokenizer ([`PreTrainedTokenizerBase`], *optional*): + The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs to the + maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an + interrupted training or reuse the fine-tuned model. + model_init (`Callable[[], PreTrainedModel]`, *optional*): + A function that instantiates the model to be used. If provided, each call to [`~Trainer.train`] will start + from a new instance of the model as given by this function. + + The function may have zero argument, or a single one containing the optuna/Ray Tune/SigOpt trial object, to + be able to choose different architectures according to hyper parameters (such as layer count, sizes of + inner layers, dropout probabilities etc). + compute_metrics (`Callable[[EvalPrediction], Dict]`, *optional*): + The function that will be used to compute metrics at evaluation. Must take a [`EvalPrediction`] and return + a dictionary string to metric values. + callbacks (List of [`TrainerCallback`], *optional*): + A list of callbacks to customize the training loop. Will add those to the list of default callbacks + detailed in [here](callback). + + If you want to remove one of the default callbacks used, use the [`Trainer.remove_callback`] method. + optimizers (`Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]`, *optional*): A tuple + containing the optimizer and the scheduler to use. Will default to an instance of [`AdamW`] on your model + and a scheduler given by [`get_linear_schedule_with_warmup`] controlled by `args`. + preprocess_logits_for_metrics (`Callable[[torch.Tensor, torch.Tensor], torch.Tensor]`, *optional*): + A function that preprocess the logits right before caching them at each evaluation step. Must take two + tensors, the logits and the labels, and return the logits once processed as desired. The modifications made + by this function will be reflected in the predictions received by `compute_metrics`. + + Note that the labels (second parameter) will be `None` if the dataset does not have them. + + Important attributes: + + - **model** -- Always points to the core model. If using a transformers model, it will be a [`PreTrainedModel`] + subclass. + - **model_wrapped** -- Always points to the most external model in case one or more other modules wrap the + original model. This is the model that should be used for the forward pass. For example, under `DeepSpeed`, + the inner model is wrapped in `DeepSpeed` and then again in `torch.nn.DistributedDataParallel`. If the inner + model hasn't been wrapped, then `self.model_wrapped` is the same as `self.model`. + - **is_model_parallel** -- Whether or not a model has been switched to a model parallel mode (different from + data parallelism, this means some of the model layers are split on different GPUs). + - **place_model_on_device** -- Whether or not to automatically place the model on the device - it will be set + to `False` if model parallel or deepspeed is used, or if the default + `TrainingArguments.place_model_on_device` is overridden to return `False` . + - **is_in_train** -- Whether or not a model is currently running `train` (e.g. when `evaluate` is called while + in `train`) + + """ + + from .trainer_pt_utils import _get_learning_rate, log_metrics, metrics_format, save_metrics, save_state + + def __init__( + self, + model: Union[PreTrainedModel, nn.Module] = None, + args: TrainingArguments = None, + data_collator: Optional[DataCollator] = None, + train_dataset: Optional[Dataset] = None, + eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None, + tokenizer: Optional[PreTrainedTokenizerBase] = None, + model_init: Optional[Callable[[], PreTrainedModel]] = None, + compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, + callbacks: Optional[List[TrainerCallback]] = None, + optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), + preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None, + ): + if args is None: + output_dir = "tmp_trainer" + logger.info(f"No `TrainingArguments` passed, using `output_dir={output_dir}`.") + args = TrainingArguments(output_dir=output_dir) + self.args = args + # Seed must be set before instantiating the model when using model + enable_full_determinism(self.args.seed) if self.args.full_determinism else set_seed(self.args.seed) + self.hp_name = None + self.deepspeed = None + self.is_in_train = False + + # memory metrics - must set up as early as possible + self._memory_tracker = TrainerMemoryTracker(self.args.skip_memory_metrics) + self._memory_tracker.start() + + # set the correct log level depending on the node + log_level = args.get_process_log_level() + logging.set_verbosity(log_level) + + # force device and distributed setup init explicitly + args._setup_devices + + if model is None: + if model_init is not None: + self.model_init = model_init + model = self.call_model_init() + else: + raise RuntimeError("`Trainer` requires either a `model` or `model_init` argument") + else: + if model_init is not None: + warnings.warn( + "`Trainer` requires either a `model` or `model_init` argument, but not both. `model_init` will" + " overwrite your model when calling the `train` method. This will become a fatal error in the next" + " release.", + FutureWarning, + ) + self.model_init = model_init + + if model.__class__.__name__ in MODEL_MAPPING_NAMES: + raise ValueError( + f"The model you have picked ({model.__class__.__name__}) cannot be used as is for training: it only " + "computes hidden states and does not accept any labels. You should choose a model with a head " + "suitable for your task like any of the `AutoModelForXxx` listed at " + "https://huggingface.co/docs/transformers/model_doc/auto." + ) + + if hasattr(model, "is_parallelizable") and model.is_parallelizable and model.model_parallel: + self.is_model_parallel = True + else: + self.is_model_parallel = False + + if ( + getattr(model, "hf_device_map", None) is not None + and len([device for device in set(model.hf_device_map.values()) if device not in ["cpu", "disk"]]) > 1 + and not self.is_model_parallel + ): + self.is_model_parallel = True + + # warn users + logger.info( + "You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set" + " to `True` to avoid any unexpected behavior such as device placement mismatching." + ) + + # At this stage the model is already loaded + if getattr(model, "is_loaded_in_8bit", False): + if getattr(model, "_is_int8_training_enabled", False): + logger.info( + "The model is loaded in 8-bit precision. To train this model you need to add additional modules" + " inside the model such as adapters using `peft` library and freeze the model weights. Please" + " check " + " the examples in https://github.com/huggingface/peft for more details." + ) + else: + raise ValueError( + "The model you want to train is loaded in 8-bit precision. if you want to fine-tune an 8-bit" + " model, please make sure that you have installed `bitsandbytes>=0.37.0`. " + ) + + # Setup Sharded DDP training + self.sharded_ddp = None + if len(args.sharded_ddp) > 0: + if args.deepspeed: + raise ValueError( + "Using --sharded_ddp xxx together with --deepspeed is not possible, deactivate one of those flags." + ) + if len(args.fsdp) > 0: + raise ValueError( + "Using --sharded_ddp xxx together with --fsdp is not possible, deactivate one of those flags." + ) + + if args.local_rank == -1: + raise ValueError("Using sharded DDP only works in distributed training.") + elif not is_fairscale_available(): + raise ImportError("Sharded DDP training requires fairscale: `pip install fairscale`.") + elif ShardedDDPOption.SIMPLE not in args.sharded_ddp and FullyShardedDDP is None: + raise ImportError( + "Sharded DDP in a mode other than simple training requires fairscale version >= 0.3, found " + f"{fairscale.__version__}. Upgrade your fairscale library: `pip install --upgrade fairscale`." + ) + elif ShardedDDPOption.SIMPLE in args.sharded_ddp: + self.sharded_ddp = ShardedDDPOption.SIMPLE + elif ShardedDDPOption.ZERO_DP_2 in args.sharded_ddp: + self.sharded_ddp = ShardedDDPOption.ZERO_DP_2 + elif ShardedDDPOption.ZERO_DP_3 in args.sharded_ddp: + self.sharded_ddp = ShardedDDPOption.ZERO_DP_3 + + self.fsdp = None + if len(args.fsdp) > 0: + if args.deepspeed: + raise ValueError( + "Using --fsdp xxx together with --deepspeed is not possible, deactivate one of those flags." + ) + if not args.fsdp_config["xla"] and args.local_rank == -1: + raise ValueError("Using fsdp only works in distributed training.") + + # dep_version_check("torch>=1.12.0") + # Would have to update setup.py with torch>=1.12.0 + # which isn't ideally given that it will force people not using FSDP to also use torch>=1.12.0 + # below is the current alternative. + if version.parse(version.parse(torch.__version__).base_version) < version.parse("1.12.0"): + raise ValueError("FSDP requires PyTorch >= 1.12.0") + + from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch, ShardingStrategy + + if FSDPOption.FULL_SHARD in args.fsdp: + self.fsdp = ShardingStrategy.FULL_SHARD + elif FSDPOption.SHARD_GRAD_OP in args.fsdp: + self.fsdp = ShardingStrategy.SHARD_GRAD_OP + elif FSDPOption.NO_SHARD in args.fsdp: + self.fsdp = ShardingStrategy.NO_SHARD + + self.backward_prefetch = BackwardPrefetch.BACKWARD_PRE + if "backward_prefetch" in self.args.fsdp_config and "backward_pos" not in self.backward_prefetch: + self.backward_prefetch = BackwardPrefetch.BACKWARD_POST + + self.forward_prefetch = False + if self.args.fsdp_config.get("forward_prefect", False): + self.forward_prefetch = True + + self.limit_all_gathers = False + if self.args.fsdp_config.get("limit_all_gathers", False): + self.limit_all_gathers = True + + # one place to sort out whether to place the model on device or not + # postpone switching model to cuda when: + # 1. MP - since we are trying to fit a much bigger than 1 gpu model + # 2. fp16-enabled DeepSpeed loads the model in half the size and it doesn't need .to() anyway, + # and we only use deepspeed for training at the moment + # 3. full bf16 or fp16 eval - since the model needs to be cast to the right dtype first + # 4. Sharded DDP - same as MP + # 5. FSDP - same as MP + self.place_model_on_device = args.place_model_on_device + if ( + self.is_model_parallel + or args.deepspeed + or ((args.fp16_full_eval or args.bf16_full_eval) and not args.do_train) + or (self.sharded_ddp in [ShardedDDPOption.ZERO_DP_2, ShardedDDPOption.ZERO_DP_3]) + or (self.fsdp is not None) + ): + self.place_model_on_device = False + + default_collator = default_data_collator if tokenizer is None else DataCollatorWithPadding(tokenizer) + self.data_collator = data_collator if data_collator is not None else default_collator + self.train_dataset = train_dataset + self.eval_dataset = eval_dataset + self.tokenizer = tokenizer + + if self.place_model_on_device and not getattr(model, "is_loaded_in_8bit", False): + self._move_model_to_device(model, args.device) + + # Force n_gpu to 1 to avoid DataParallel as MP will manage the GPUs + if self.is_model_parallel: + self.args._n_gpu = 1 + + # later use `self.model is self.model_wrapped` to check if it's wrapped or not + self.model_wrapped = model + self.model = model + + self.compute_metrics = compute_metrics + self.preprocess_logits_for_metrics = preprocess_logits_for_metrics + self.optimizer, self.lr_scheduler = optimizers + if model_init is not None and (self.optimizer is not None or self.lr_scheduler is not None): + raise RuntimeError( + "Passing a `model_init` is incompatible with providing the `optimizers` argument. " + "You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method." + ) + if is_torch_tpu_available() and self.optimizer is not None: + for param in self.model.parameters(): + model_device = param.device + break + for param_group in self.optimizer.param_groups: + if len(param_group["params"]) > 0: + optimizer_device = param_group["params"][0].device + break + if model_device != optimizer_device: + raise ValueError( + "The model and the optimizer parameters are not on the same device, which probably means you" + " created an optimizer around your model **before** putting on the device and passing it to the" + " `Trainer`. Make sure the lines `import torch_xla.core.xla_model as xm` and" + " `model.to(xm.xla_device())` is performed before the optimizer creation in your script." + ) + if ((self.sharded_ddp is not None) or args.deepspeed or (self.fsdp is not None)) and ( + self.optimizer is not None or self.lr_scheduler is not None + ): + raise RuntimeError( + "Passing `optimizers` is not allowed if Fairscale, Deepspeed or PyTorch FSDP is enabled." + "You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method." + ) + default_callbacks = DEFAULT_CALLBACKS + get_reporting_integration_callbacks(self.args.report_to) + callbacks = default_callbacks if callbacks is None else default_callbacks + callbacks + self.callback_handler = CallbackHandler( + callbacks, self.model, self.tokenizer, self.optimizer, self.lr_scheduler + ) + self.add_callback(PrinterCallback if self.args.disable_tqdm else DEFAULT_PROGRESS_CALLBACK) + + # Will be set to True by `self._setup_loggers()` on first call to `self.log()`. + self._loggers_initialized = False + + # Create clone of distant repo and output directory if needed + if self.args.push_to_hub: + self.init_git_repo(at_init=True) + # In case of pull, we need to make sure every process has the latest. + if is_torch_tpu_available(): + xm.rendezvous("init git repo") + elif args.local_rank != -1: + dist.barrier() + + if self.args.should_save: + os.makedirs(self.args.output_dir, exist_ok=True) + + if not callable(self.data_collator) and callable(getattr(self.data_collator, "collate_batch", None)): + raise ValueError("The `data_collator` should be a simple callable (function, class with `__call__`).") + + if args.max_steps > 0: + logger.info("max_steps is given, it will override any value given in num_train_epochs") + + if train_dataset is not None and not has_length(train_dataset) and args.max_steps <= 0: + raise ValueError( + "The train_dataset does not implement __len__, max_steps has to be specified. " + "The number of steps needs to be known in advance for the learning rate scheduler." + ) + + if ( + train_dataset is not None + and isinstance(train_dataset, torch.utils.data.IterableDataset) + and args.group_by_length + ): + raise ValueError("the `--group_by_length` option is only available for `Dataset`, not `IterableDataset") + + self._signature_columns = None + + # Mixed precision setup + self.use_apex = False + self.use_cuda_amp = False + self.use_cpu_amp = False + + # Mixed precision setup for SageMaker Model Parallel + if is_sagemaker_mp_enabled(): + # BF16 + model parallelism in SageMaker: currently not supported, raise an error + if args.bf16: + raise ValueError("SageMaker Model Parallelism does not support BF16 yet. Please use FP16 instead ") + + if IS_SAGEMAKER_MP_POST_1_10: + # When there's mismatch between SMP config and trainer argument, use SMP config as truth + if args.fp16 != smp.state.cfg.fp16: + logger.warning( + f"FP16 provided in SM_HP_MP_PARAMETERS is {smp.state.cfg.fp16}," + f"but FP16 provided in trainer argument is {args.fp16}," + f"setting to {smp.state.cfg.fp16}" + ) + args.fp16 = smp.state.cfg.fp16 + else: + # smp < 1.10 does not support fp16 in trainer. + if hasattr(smp.state.cfg, "fp16"): + logger.warning( + f"FP16 provided in SM_HP_MP_PARAMETERS is {smp.state.cfg.fp16}, " + "but SageMaker Model Parallelism < 1.10 does not support FP16 in trainer." + ) + + if args.fp16 or args.bf16: + if args.half_precision_backend == "auto": + if args.device == torch.device("cpu"): + if args.fp16: + raise ValueError("Tried to use `fp16` but it is not supported on cpu") + elif _is_native_cpu_amp_available: + args.half_precision_backend = "cpu_amp" + else: + raise ValueError("Tried to use cpu amp but native cpu amp is not available") + else: + args.half_precision_backend = "cuda_amp" + + logger.info(f"Using {args.half_precision_backend} half precision backend") + + self.do_grad_scaling = False + if (args.fp16 or args.bf16) and not (args.deepspeed or is_sagemaker_mp_enabled()): + # deepspeed and SageMaker Model Parallel manage their own half precision + if args.half_precision_backend == "cuda_amp": + self.use_cuda_amp = True + self.amp_dtype = torch.float16 if args.fp16 else torch.bfloat16 + # bf16 does not need grad scaling + self.do_grad_scaling = self.amp_dtype == torch.float16 + if self.do_grad_scaling: + if self.sharded_ddp is not None: + self.scaler = ShardedGradScaler() + elif self.fsdp is not None: + from torch.distributed.fsdp.sharded_grad_scaler import ( + ShardedGradScaler as FSDPShardedGradScaler, + ) + + self.scaler = FSDPShardedGradScaler() + elif is_torch_tpu_available(): + from torch_xla.amp import GradScaler + + self.scaler = GradScaler() + else: + self.scaler = torch.cuda.amp.GradScaler() + elif args.half_precision_backend == "cpu_amp": + self.use_cpu_amp = True + self.amp_dtype = torch.bfloat16 + else: + if not is_apex_available(): + raise ImportError( + "Using FP16 with APEX but APEX is not installed, please refer to" + " https://www.github.com/nvidia/apex." + ) + self.use_apex = True + + # FP16 + model parallelism in SageMaker: gradient clipping does not work for now so we raise a helpful error. + if ( + is_sagemaker_mp_enabled() + and self.use_cuda_amp + and args.max_grad_norm is not None + and args.max_grad_norm > 0 + ): + raise ValueError( + "SageMaker Model Parallelism in mixed precision mode does not support gradient clipping yet. Pass " + "along 'max_grad_norm': 0 in your hyperparameters." + ) + + # Label smoothing + if self.args.label_smoothing_factor != 0: + self.label_smoother = LabelSmoother(epsilon=self.args.label_smoothing_factor) + else: + self.label_smoother = None + + self.state = TrainerState( + is_local_process_zero=self.is_local_process_zero(), + is_world_process_zero=self.is_world_process_zero(), + ) + + self.control = TrainerControl() + # Internal variable to count flos in each process, will be accumulated in `self.state.total_flos` then + # returned to 0 every time flos need to be logged + self.current_flos = 0 + self.hp_search_backend = None + self.use_tune_checkpoints = False + default_label_names = find_labels(self.model.__class__) + self.label_names = default_label_names if self.args.label_names is None else self.args.label_names + self.can_return_loss = can_return_loss(self.model.__class__) + self.control = self.callback_handler.on_init_end(self.args, self.state, self.control) + + # Internal variables to keep track of the original batch size + self._train_batch_size = args.train_batch_size + + # very last + self._memory_tracker.stop_and_update_metrics() + + # torch.compile + if args.torch_compile and not is_torch_compile_available(): + raise RuntimeError("Using torch.compile requires PyTorch 2.0 or higher.") + + def add_callback(self, callback): + """ + Add a callback to the current list of [`~transformer.TrainerCallback`]. + + Args: + callback (`type` or [`~transformer.TrainerCallback`]): + A [`~transformer.TrainerCallback`] class or an instance of a [`~transformer.TrainerCallback`]. In the + first case, will instantiate a member of that class. + """ + self.callback_handler.add_callback(callback) + + def pop_callback(self, callback): + """ + Remove a callback from the current list of [`~transformer.TrainerCallback`] and returns it. + + If the callback is not found, returns `None` (and no error is raised). + + Args: + callback (`type` or [`~transformer.TrainerCallback`]): + A [`~transformer.TrainerCallback`] class or an instance of a [`~transformer.TrainerCallback`]. In the + first case, will pop the first member of that class found in the list of callbacks. + + Returns: + [`~transformer.TrainerCallback`]: The callback removed, if found. + """ + return self.callback_handler.pop_callback(callback) + + def remove_callback(self, callback): + """ + Remove a callback from the current list of [`~transformer.TrainerCallback`]. + + Args: + callback (`type` or [`~transformer.TrainerCallback`]): + A [`~transformer.TrainerCallback`] class or an instance of a [`~transformer.TrainerCallback`]. In the + first case, will remove the first member of that class found in the list of callbacks. + """ + self.callback_handler.remove_callback(callback) + + def _move_model_to_device(self, model, device): + model = model.to(device) + # Moving a model to an XLA device disconnects the tied weights, so we have to retie them. + if self.args.parallel_mode == ParallelMode.TPU and hasattr(model, "tie_weights"): + model.tie_weights() + + def _set_signature_columns_if_needed(self): + if self._signature_columns is None: + # Inspect model forward signature to keep only the arguments it accepts. + signature = inspect.signature(self.model.forward) + self._signature_columns = list(signature.parameters.keys()) + # Labels may be named label or label_ids, the default data collator handles that. + self._signature_columns += list(set(["label", "label_ids"] + self.label_names)) + + def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None): + if not self.args.remove_unused_columns: + return dataset + self._set_signature_columns_if_needed() + signature_columns = self._signature_columns + + ignored_columns = list(set(dataset.column_names) - set(signature_columns)) + if len(ignored_columns) > 0: + dset_description = "" if description is None else f"in the {description} set" + logger.info( + f"The following columns {dset_description} don't have a corresponding argument in " + f"`{self.model.__class__.__name__}.forward` and have been ignored: {', '.join(ignored_columns)}." + f" If {', '.join(ignored_columns)} are not expected by `{self.model.__class__.__name__}.forward`, " + " you can safely ignore this message." + ) + + columns = [k for k in signature_columns if k in dataset.column_names] + + if version.parse(datasets.__version__) < version.parse("1.4.0"): + dataset.set_format( + type=dataset.format["type"], columns=columns, format_kwargs=dataset.format["format_kwargs"] + ) + return dataset + else: + return dataset.remove_columns(ignored_columns) + + def _get_collator_with_removed_columns( + self, data_collator: Callable, description: Optional[str] = None + ) -> Callable: + """Wrap the data collator in a callable removing unused columns.""" + if not self.args.remove_unused_columns: + return data_collator + self._set_signature_columns_if_needed() + signature_columns = self._signature_columns + + remove_columns_collator = RemoveColumnsCollator( + data_collator=data_collator, + signature_columns=signature_columns, + logger=logger, + description=description, + model_name=self.model.__class__.__name__, + ) + return remove_columns_collator + + def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]: + if self.train_dataset is None or not has_length(self.train_dataset): + return None + + generator = None + if self.args.world_size <= 1: + generator = torch.Generator() + # for backwards compatibility, we generate a seed here (which is sampled from a generator seeded with + # `args.seed`) if data_seed isn't provided. + # Further on in this method, we default to `args.seed` instead. + if self.args.data_seed is None: + seed = int(torch.empty((), dtype=torch.int64).random_().item()) + else: + seed = self.args.data_seed + generator.manual_seed(seed) + + seed = self.args.data_seed if self.args.data_seed is not None else self.args.seed + + # Build the sampler. + if self.args.group_by_length: + if is_datasets_available() and isinstance(self.train_dataset, datasets.Dataset): + lengths = ( + self.train_dataset[self.args.length_column_name] + if self.args.length_column_name in self.train_dataset.column_names + else None + ) + else: + lengths = None + model_input_name = self.tokenizer.model_input_names[0] if self.tokenizer is not None else None + if self.args.world_size <= 1: + return LengthGroupedSampler( + self.args.train_batch_size * self.args.gradient_accumulation_steps, + dataset=self.train_dataset, + lengths=lengths, + model_input_name=model_input_name, + generator=generator, + ) + else: + return DistributedLengthGroupedSampler( + self.args.train_batch_size * self.args.gradient_accumulation_steps, + dataset=self.train_dataset, + num_replicas=self.args.world_size, + rank=self.args.process_index, + lengths=lengths, + model_input_name=model_input_name, + seed=seed, + ) + + else: + if self.args.world_size <= 1: + return RandomSampler(self.train_dataset, generator=generator) + elif ( + self.args.parallel_mode in [ParallelMode.TPU, ParallelMode.SAGEMAKER_MODEL_PARALLEL] + and not self.args.dataloader_drop_last + ): + # Use a loop for TPUs when drop_last is False to have all batches have the same size. + return DistributedSamplerWithLoop( + self.train_dataset, + batch_size=self.args.per_device_train_batch_size, + num_replicas=self.args.world_size, + rank=self.args.process_index, + seed=seed, + ) + else: + return DistributedSampler( + self.train_dataset, + num_replicas=self.args.world_size, + rank=self.args.process_index, + seed=seed, + ) + + def get_train_dataloader(self) -> DataLoader: + """ + Returns the training [`~torch.utils.data.DataLoader`]. + + Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed + training if necessary) otherwise. + + Subclass and override this method if you want to inject some custom behavior. + """ + if self.train_dataset is None: + raise ValueError("Trainer: training requires a train_dataset.") + + train_dataset = self.train_dataset + data_collator = self.data_collator + if is_datasets_available() and isinstance(train_dataset, datasets.Dataset): + train_dataset = self._remove_unused_columns(train_dataset, description="training") + else: + data_collator = self._get_collator_with_removed_columns(data_collator, description="training") + + if isinstance(train_dataset, torch.utils.data.IterableDataset): + if self.args.world_size > 1: + train_dataset = IterableDatasetShard( + train_dataset, + batch_size=self._train_batch_size, + drop_last=self.args.dataloader_drop_last, + num_processes=self.args.world_size, + process_index=self.args.process_index, + ) + + return DataLoader( + train_dataset, + batch_size=self._train_batch_size, + collate_fn=data_collator, + num_workers=self.args.dataloader_num_workers, + pin_memory=self.args.dataloader_pin_memory, + ) + + train_sampler = self._get_train_sampler() + + return DataLoader( + train_dataset, + batch_size=self._train_batch_size, + sampler=train_sampler, + collate_fn=data_collator, + drop_last=self.args.dataloader_drop_last, + num_workers=self.args.dataloader_num_workers, + pin_memory=self.args.dataloader_pin_memory, + worker_init_fn=seed_worker, + ) + + def _get_eval_sampler(self, eval_dataset: Dataset) -> Optional[torch.utils.data.Sampler]: + # Deprecated code + if self.args.use_legacy_prediction_loop: + if is_torch_tpu_available(): + return SequentialDistributedSampler( + eval_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal() + ) + elif is_sagemaker_mp_enabled(): + return SequentialDistributedSampler( + eval_dataset, + num_replicas=smp.dp_size(), + rank=smp.dp_rank(), + batch_size=self.args.per_device_eval_batch_size, + ) + elif self.args.local_rank != -1: + return SequentialDistributedSampler(eval_dataset) + else: + return SequentialSampler(eval_dataset) + + if self.args.world_size <= 1: + return SequentialSampler(eval_dataset) + else: + return ShardSampler( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + num_processes=self.args.world_size, + process_index=self.args.process_index, + ) + + def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader: + """ + Returns the evaluation [`~torch.utils.data.DataLoader`]. + + Subclass and override this method if you want to inject some custom behavior. + + Args: + eval_dataset (`torch.utils.data.Dataset`, *optional*): + If provided, will override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns not accepted + by the `model.forward()` method are automatically removed. It must implement `__len__`. + """ + if eval_dataset is None and self.eval_dataset is None: + raise ValueError("Trainer: evaluation requires an eval_dataset.") + eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset + data_collator = self.data_collator + + if is_datasets_available() and isinstance(eval_dataset, datasets.Dataset): + eval_dataset = self._remove_unused_columns(eval_dataset, description="evaluation") + else: + data_collator = self._get_collator_with_removed_columns(data_collator, description="evaluation") + + if isinstance(eval_dataset, torch.utils.data.IterableDataset): + if self.args.world_size > 1: + eval_dataset = IterableDatasetShard( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + drop_last=self.args.dataloader_drop_last, + num_processes=self.args.world_size, + process_index=self.args.process_index, + ) + return DataLoader( + eval_dataset, + batch_size=self.args.eval_batch_size, + collate_fn=data_collator, + num_workers=self.args.dataloader_num_workers, + pin_memory=self.args.dataloader_pin_memory, + ) + + eval_sampler = self._get_eval_sampler(eval_dataset) + + return DataLoader( + eval_dataset, + sampler=eval_sampler, + batch_size=self.args.eval_batch_size, + collate_fn=data_collator, + drop_last=self.args.dataloader_drop_last, + num_workers=self.args.dataloader_num_workers, + pin_memory=self.args.dataloader_pin_memory, + ) + + def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader: + """ + Returns the test [`~torch.utils.data.DataLoader`]. + + Subclass and override this method if you want to inject some custom behavior. + + Args: + test_dataset (`torch.utils.data.Dataset`, *optional*): + The test dataset to use. If it is a [`~datasets.Dataset`], columns not accepted by the + `model.forward()` method are automatically removed. It must implement `__len__`. + """ + data_collator = self.data_collator + + if is_datasets_available() and isinstance(test_dataset, datasets.Dataset): + test_dataset = self._remove_unused_columns(test_dataset, description="test") + else: + data_collator = self._get_collator_with_removed_columns(data_collator, description="test") + + if isinstance(test_dataset, torch.utils.data.IterableDataset): + if self.args.world_size > 1: + test_dataset = IterableDatasetShard( + test_dataset, + batch_size=self.args.eval_batch_size, + drop_last=self.args.dataloader_drop_last, + num_processes=self.args.world_size, + process_index=self.args.process_index, + ) + return DataLoader( + test_dataset, + batch_size=self.args.eval_batch_size, + collate_fn=data_collator, + num_workers=self.args.dataloader_num_workers, + pin_memory=self.args.dataloader_pin_memory, + ) + + test_sampler = self._get_eval_sampler(test_dataset) + + # We use the same batch_size as for eval. + return DataLoader( + test_dataset, + sampler=test_sampler, + batch_size=self.args.eval_batch_size, + collate_fn=data_collator, + drop_last=self.args.dataloader_drop_last, + num_workers=self.args.dataloader_num_workers, + pin_memory=self.args.dataloader_pin_memory, + ) + + def create_optimizer_and_scheduler(self, num_training_steps: int): + """ + Setup the optimizer and the learning rate scheduler. + + We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the + Trainer's init through `optimizers`, or subclass and override this method (or `create_optimizer` and/or + `create_scheduler`) in a subclass. + """ + self.create_optimizer() + if IS_SAGEMAKER_MP_POST_1_10 and smp.state.cfg.fp16: + # If smp >= 1.10 and fp16 is enabled, we unwrap the optimizer + optimizer = self.optimizer.optimizer + else: + optimizer = self.optimizer + self.create_scheduler(num_training_steps=num_training_steps, optimizer=optimizer) + + def create_optimizer(self): + """ + Setup the optimizer. + + We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the + Trainer's init through `optimizers`, or subclass and override this method in a subclass. + """ + opt_model = self.model_wrapped if is_sagemaker_mp_enabled() else self.model + + if self.optimizer is None: + decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS) + decay_parameters = [name for name in decay_parameters if "bias" not in name] + optimizer_grouped_parameters = [ + { + "params": [ + p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad) + ], + "weight_decay": self.args.weight_decay, + }, + { + "params": [ + p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad) + ], + "weight_decay": 0.0, + }, + ] + + optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args) + + if self.sharded_ddp == ShardedDDPOption.SIMPLE: + self.optimizer = OSS( + params=optimizer_grouped_parameters, + optim=optimizer_cls, + **optimizer_kwargs, + ) + else: + self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs) + if optimizer_cls.__name__ == "Adam8bit": + import bitsandbytes + + manager = bitsandbytes.optim.GlobalOptimManager.get_instance() + + skipped = 0 + for module in opt_model.modules(): + if isinstance(module, nn.Embedding): + skipped += sum({p.data_ptr(): p.numel() for p in module.parameters()}.values()) + print(f"skipped {module}: {skipped/2**20}M params") + manager.register_module_override(module, "weight", {"optim_bits": 32}) + logger.debug(f"bitsandbytes: will optimize {module} in fp32") + print(f"skipped: {skipped/2**20}M params") + + if is_sagemaker_mp_enabled(): + self.optimizer = smp.DistributedOptimizer(self.optimizer) + + return self.optimizer + + @staticmethod + def get_optimizer_cls_and_kwargs(args: TrainingArguments) -> Tuple[Any, Any]: + """ + Returns the optimizer class and optimizer parameters based on the training arguments. + + Args: + args (`transformers.training_args.TrainingArguments`): + The training arguments for the training session. + + """ + + # parse args.optim_args + optim_args = {} + if args.optim_args: + for mapping in args.optim_args.replace(" ", "").split(","): + key, value = mapping.split("=") + optim_args[key] = value + + optimizer_kwargs = {"lr": args.learning_rate} + + adam_kwargs = { + "betas": (args.adam_beta1, args.adam_beta2), + "eps": args.adam_epsilon, + } + if args.optim == OptimizerNames.ADAFACTOR: + optimizer_cls = Adafactor + optimizer_kwargs.update({"scale_parameter": False, "relative_step": False}) + elif args.optim == OptimizerNames.ADAMW_HF: + from .optimization import AdamW + + optimizer_cls = AdamW + optimizer_kwargs.update(adam_kwargs) + elif args.optim in [OptimizerNames.ADAMW_TORCH, OptimizerNames.ADAMW_TORCH_FUSED]: + from torch.optim import AdamW + + optimizer_cls = AdamW + optimizer_kwargs.update(adam_kwargs) + if args.optim == OptimizerNames.ADAMW_TORCH_FUSED: + optimizer_kwargs.update({"fused": True}) + elif args.optim == OptimizerNames.ADAMW_TORCH_XLA: + try: + from torch_xla.amp.syncfree import AdamW + + optimizer_cls = AdamW + optimizer_kwargs.update(adam_kwargs) + except ImportError: + raise ValueError("Trainer failed to import syncfree AdamW from torch_xla.") + elif args.optim == OptimizerNames.ADAMW_APEX_FUSED: + try: + from apex.optimizers import FusedAdam + + optimizer_cls = FusedAdam + optimizer_kwargs.update(adam_kwargs) + except ImportError: + raise ValueError("Trainer tried to instantiate apex FusedAdam but apex is not installed!") + elif args.optim == OptimizerNames.ADAMW_BNB: + try: + from bitsandbytes.optim import Adam8bit + + optimizer_cls = Adam8bit + optimizer_kwargs.update(adam_kwargs) + except ImportError: + raise ValueError("Trainer tried to instantiate bnb Adam8bit but bnb is not installed!") + elif args.optim == OptimizerNames.ADAMW_ANYPRECISION: + try: + from torchdistx.optimizers import AnyPrecisionAdamW + + optimizer_cls = AnyPrecisionAdamW + optimizer_kwargs.update(adam_kwargs) + + # TODO Change dtypes back to M=FP32, Var = BF16, Kahan = False once they can be cast together in torchdistx. + optimizer_kwargs.update( + { + "use_kahan_summation": strtobool(optim_args.get("use_kahan_summation", "False")), + "momentum_dtype": getattr(torch, optim_args.get("momentum_dtype", "float32")), + "variance_dtype": getattr(torch, optim_args.get("variance_dtype", "float32")), + "compensation_buffer_dtype": getattr( + torch, optim_args.get("compensation_buffer_dtype", "bfloat16") + ), + } + ) + except ImportError: + raise ValueError("Please install https://github.com/pytorch/torchdistx") + elif args.optim == OptimizerNames.SGD: + optimizer_cls = torch.optim.SGD + elif args.optim == OptimizerNames.ADAGRAD: + optimizer_cls = torch.optim.Adagrad + elif args.optim == OptimizerNames.ADAMW_TORCH_NPU_FUSED: + try: + from torch_npu.optim import NpuFusedAdamW + + optimizer_cls = NpuFusedAdamW + optimizer_kwargs.update(adam_kwargs) + except ImportError: + raise ValueError("Trainer failed to import NpuFusedAdamW from torch_npu.optim.") + else: + raise ValueError(f"Trainer cannot instantiate unsupported optimizer: {args.optim}") + return optimizer_cls, optimizer_kwargs + + def create_scheduler(self, num_training_steps: int, optimizer: torch.optim.Optimizer = None): + """ + Setup the scheduler. The optimizer of the trainer must have been set up either before this method is called or + passed as an argument. + + Args: + num_training_steps (int): The number of training steps to do. + """ + if self.lr_scheduler is None: + self.lr_scheduler = get_scheduler( + self.args.lr_scheduler_type, + optimizer=self.optimizer if optimizer is None else optimizer, + num_warmup_steps=self.args.get_warmup_steps(num_training_steps), + num_training_steps=num_training_steps, + ) + return self.lr_scheduler + + def num_examples(self, dataloader: DataLoader) -> int: + """ + Helper to get number of samples in a [`~torch.utils.data.DataLoader`] by accessing its dataset. When + dataloader.dataset does not exist or has no length, estimates as best it can + """ + try: + dataset = dataloader.dataset + # Special case for IterableDatasetShard, we need to dig deeper + if isinstance(dataset, IterableDatasetShard): + return len(dataloader.dataset.dataset) + return len(dataloader.dataset) + except (NameError, AttributeError, TypeError): # no dataset or length, estimate by length of dataloader + return len(dataloader) * self.args.per_device_train_batch_size + + def _hp_search_setup(self, trial: Union["optuna.Trial", Dict[str, Any]]): + """HP search setup code""" + self._trial = trial + + if self.hp_search_backend is None or trial is None: + return + if self.hp_search_backend == HPSearchBackend.OPTUNA: + params = self.hp_space(trial) + elif self.hp_search_backend == HPSearchBackend.RAY: + params = trial + params.pop("wandb", None) + elif self.hp_search_backend == HPSearchBackend.SIGOPT: + params = {k: int(v) if isinstance(v, str) else v for k, v in trial.assignments.items()} + elif self.hp_search_backend == HPSearchBackend.WANDB: + params = trial + + for key, value in params.items(): + if not hasattr(self.args, key): + logger.warning( + f"Trying to set {key} in the hyperparameter search but there is no corresponding field in" + " `TrainingArguments`." + ) + continue + old_attr = getattr(self.args, key, None) + # Casting value to the proper type + if old_attr is not None: + value = type(old_attr)(value) + setattr(self.args, key, value) + if self.hp_search_backend == HPSearchBackend.OPTUNA: + logger.info(f"Trial: {trial.params}") + if self.hp_search_backend == HPSearchBackend.SIGOPT: + logger.info(f"SigOpt Assignments: {trial.assignments}") + if self.hp_search_backend == HPSearchBackend.WANDB: + logger.info(f"W&B Sweep parameters: {trial}") + if self.args.deepspeed: + # Rebuild the deepspeed config to reflect the updated training parameters + from transformers.deepspeed import HfTrainerDeepSpeedConfig + + self.args.hf_deepspeed_config = HfTrainerDeepSpeedConfig(self.args.deepspeed) + self.args.hf_deepspeed_config.trainer_config_process(self.args) + + def _report_to_hp_search(self, trial: Union["optuna.Trial", Dict[str, Any]], step: int, metrics: Dict[str, float]): + if self.hp_search_backend is None or trial is None: + return + self.objective = self.compute_objective(metrics.copy()) + if self.hp_search_backend == HPSearchBackend.OPTUNA: + import optuna + + trial.report(self.objective, step) + if trial.should_prune(): + self.callback_handler.on_train_end(self.args, self.state, self.control) + raise optuna.TrialPruned() + elif self.hp_search_backend == HPSearchBackend.RAY: + from ray import tune + + if self.control.should_save: + self._tune_save_checkpoint() + tune.report(objective=self.objective, **metrics) + + def _tune_save_checkpoint(self): + from ray import tune + + if not self.use_tune_checkpoints: + return + with tune.checkpoint_dir(step=self.state.global_step) as checkpoint_dir: + output_dir = os.path.join(checkpoint_dir, f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}") + self.save_model(output_dir, _internal_call=True) + if self.args.should_save: + self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME)) + torch.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME)) + torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME)) + + def call_model_init(self, trial=None): + model_init_argcount = number_of_arguments(self.model_init) + if model_init_argcount == 0: + model = self.model_init() + elif model_init_argcount == 1: + model = self.model_init(trial) + else: + raise RuntimeError("model_init should have 0 or 1 argument.") + + if model is None: + raise RuntimeError("model_init should not return None.") + + return model + + def torch_jit_model_eval(self, model, dataloader, training=False): + if not training: + if dataloader is None: + logger.warning("failed to use PyTorch jit mode due to current dataloader is none.") + return model + example_batch = next(iter(dataloader)) + example_batch = self._prepare_inputs(example_batch) + try: + jit_model = model.eval() + with ContextManagers([self.autocast_smart_context_manager(cache_enabled=False), torch.no_grad()]): + if version.parse(version.parse(torch.__version__).base_version) >= version.parse("1.14.0"): + if isinstance(example_batch, dict): + jit_model = torch.jit.trace(jit_model, example_kwarg_inputs=example_batch, strict=False) + else: + jit_model = torch.jit.trace( + jit_model, + example_kwarg_inputs={key: example_batch[key] for key in example_batch}, + strict=False, + ) + else: + jit_inputs = [] + for key in example_batch: + example_tensor = torch.ones_like(example_batch[key]) + jit_inputs.append(example_tensor) + jit_inputs = tuple(jit_inputs) + jit_model = torch.jit.trace(jit_model, jit_inputs, strict=False) + jit_model = torch.jit.freeze(jit_model) + with torch.no_grad(): + jit_model(**example_batch) + jit_model(**example_batch) + model = jit_model + self.use_cpu_amp = False + self.use_cuda_amp = False + except (RuntimeError, TypeError, ValueError, NameError, IndexError) as e: + logger.warning(f"failed to use PyTorch jit mode due to: {e}.") + + return model + + def ipex_optimize_model(self, model, training=False, dtype=torch.float32): + if not is_ipex_available(): + raise ImportError( + "Using IPEX but IPEX is not installed or IPEX's version does not match current PyTorch, please refer" + " to https://github.com/intel/intel-extension-for-pytorch." + ) + + import intel_extension_for_pytorch as ipex + + if not training: + model.eval() + dtype = torch.bfloat16 if not self.is_in_train and self.args.bf16_full_eval else dtype + # conv_bn_folding is disabled as it fails in symbolic tracing, resulting in ipex warnings + model = ipex.optimize(model, dtype=dtype, level="O1", conv_bn_folding=False, inplace=not self.is_in_train) + else: + if not model.training: + model.train() + model, self.optimizer = ipex.optimize( + model, dtype=dtype, optimizer=self.optimizer, inplace=True, level="O1" + ) + + return model + + def _wrap_model(self, model, training=True, dataloader=None): + if self.args.use_ipex: + dtype = torch.bfloat16 if self.use_cpu_amp else torch.float32 + model = self.ipex_optimize_model(model, training, dtype=dtype) + + if is_sagemaker_mp_enabled(): + # Wrapping the base model twice in a DistributedModel will raise an error. + if isinstance(self.model_wrapped, smp.model.DistributedModel): + return self.model_wrapped + return smp.DistributedModel(model, backward_passes_per_step=self.args.gradient_accumulation_steps) + + # already initialized its own DDP and AMP + if self.deepspeed: + return self.deepspeed + + # train/eval could be run multiple-times - if already wrapped, don't re-wrap it again + if unwrap_model(model) is not model: + return model + + # Mixed precision training with apex (torch < 1.6) + if self.use_apex and training: + model, self.optimizer = amp.initialize(model, self.optimizer, opt_level=self.args.fp16_opt_level) + + # Multi-gpu training (should be after apex fp16 initialization) / 8bit models does not support DDP + if self.args.n_gpu > 1 and not getattr(model, "is_loaded_in_8bit", False): + model = nn.DataParallel(model) + + if self.args.jit_mode_eval: + start_time = time.time() + model = self.torch_jit_model_eval(model, dataloader, training) + self.jit_compilation_time = round(time.time() - start_time, 4) + + # Note: in torch.distributed mode, there's no point in wrapping the model + # inside a DistributedDataParallel as we'll be under `no_grad` anyways. + if not training: + return model + + # Distributed training (should be after apex fp16 initialization) + if self.sharded_ddp is not None: + # Sharded DDP! + if self.sharded_ddp == ShardedDDPOption.SIMPLE: + model = ShardedDDP(model, self.optimizer) + else: + mixed_precision = self.args.fp16 or self.args.bf16 + cpu_offload = ShardedDDPOption.OFFLOAD in self.args.sharded_ddp + zero_3 = self.sharded_ddp == ShardedDDPOption.ZERO_DP_3 + # XXX: Breaking the self.model convention but I see no way around it for now. + if ShardedDDPOption.AUTO_WRAP in self.args.sharded_ddp: + model = auto_wrap(model) + self.model = model = FullyShardedDDP( + model, + mixed_precision=mixed_precision, + reshard_after_forward=zero_3, + cpu_offload=cpu_offload, + ).to(self.args.device) + # Distributed training using PyTorch FSDP + elif self.fsdp is not None: + if not self.args.fsdp_config["xla"]: + # PyTorch FSDP! + from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload, MixedPrecision + from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP + from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy + + if FSDPOption.OFFLOAD in self.args.fsdp: + cpu_offload = CPUOffload(offload_params=True) + else: + cpu_offload = CPUOffload(offload_params=False) + + auto_wrap_policy = None + + if FSDPOption.AUTO_WRAP in self.args.fsdp: + if self.args.fsdp_config["fsdp_min_num_params"] > 0: + auto_wrap_policy = functools.partial( + size_based_auto_wrap_policy, min_num_params=self.args.fsdp_config["fsdp_min_num_params"] + ) + elif self.args.fsdp_config.get("fsdp_transformer_layer_cls_to_wrap", None) is not None: + transformer_cls_to_wrap = set() + for layer_class in self.args.fsdp_config["fsdp_transformer_layer_cls_to_wrap"]: + transformer_cls = get_module_class_from_name(model, layer_class) + if transformer_cls is None: + raise Exception("Could not find the transformer layer class to wrap in the model.") + else: + transformer_cls_to_wrap.add(transformer_cls) + auto_wrap_policy = functools.partial( + transformer_auto_wrap_policy, + # Transformer layer class to wrap + transformer_layer_cls=transformer_cls_to_wrap, + ) + mixed_precision_policy = None + dtype = None + if self.args.fp16: + dtype = torch.float16 + elif self.args.bf16: + dtype = torch.bfloat16 + if dtype is not None: + mixed_precision_policy = MixedPrecision(param_dtype=dtype, reduce_dtype=dtype, buffer_dtype=dtype) + if type(model) != FSDP: + # XXX: Breaking the self.model convention but I see no way around it for now. + signature = inspect.signature(FSDP.__init__).parameters.keys() + kwargs = {} + for arg in ["limit_all_gathers", "forward_prefetch", "backward_prefetch"]: + if arg in signature: + kwargs[arg] = getattr(self, arg) + self.model = model = FSDP( + model, + sharding_strategy=self.fsdp, + cpu_offload=cpu_offload, + auto_wrap_policy=auto_wrap_policy, + mixed_precision=mixed_precision_policy, + device_id=self.args.device, + **kwargs, + ) + else: + try: + from torch_xla.distributed.fsdp import XlaFullyShardedDataParallel as FSDP + from torch_xla.distributed.fsdp import checkpoint_module + from torch_xla.distributed.fsdp.wrap import ( + size_based_auto_wrap_policy, + transformer_auto_wrap_policy, + ) + except ImportError: + raise ImportError("Missing XLA FSDP related module; please make sure to use torch-xla >= 2.0.") + auto_wrap_policy = None + auto_wrapper_callable = None + if self.args.fsdp_config["fsdp_min_num_params"] > 0: + auto_wrap_policy = functools.partial( + size_based_auto_wrap_policy, min_num_params=self.args.fsdp_config["fsdp_min_num_params"] + ) + elif self.args.fsdp_config.get("fsdp_transformer_layer_cls_to_wrap", None) is not None: + transformer_cls_to_wrap = set() + for layer_class in self.args.fsdp_config["fsdp_transformer_layer_cls_to_wrap"]: + transformer_cls = get_module_class_from_name(model, layer_class) + if transformer_cls is None: + raise Exception("Could not find the transformer layer class to wrap in the model.") + else: + transformer_cls_to_wrap.add(transformer_cls) + auto_wrap_policy = functools.partial( + transformer_auto_wrap_policy, + # Transformer layer class to wrap + transformer_layer_cls=transformer_cls_to_wrap, + ) + fsdp_kwargs = self.args.xla_fsdp_config + if self.args.fsdp_config["xla_fsdp_grad_ckpt"]: + # Apply gradient checkpointing to auto-wrapped sub-modules if specified + def auto_wrapper_callable(m, *args, **kwargs): + return FSDP(checkpoint_module(m), *args, **kwargs) + + # Wrap the base model with an outer FSDP wrapper + self.model = model = FSDP( + model, + auto_wrap_policy=auto_wrap_policy, + auto_wrapper_callable=auto_wrapper_callable, + **fsdp_kwargs, + ) + + # Patch `xm.optimizer_step` should not reduce gradients in this case, + # as FSDP does not need gradient reduction over sharded parameters. + def patched_optimizer_step(optimizer, barrier=False, optimizer_args={}): + loss = optimizer.step(**optimizer_args) + if barrier: + xm.mark_step() + return loss + + xm.optimizer_step = patched_optimizer_step + elif is_sagemaker_dp_enabled(): + model = nn.parallel.DistributedDataParallel( + model, device_ids=[int(os.getenv("SMDATAPARALLEL_LOCAL_RANK"))] + ) + elif self.args.local_rank != -1: + kwargs = {} + if self.args.ddp_find_unused_parameters is not None: + kwargs["find_unused_parameters"] = self.args.ddp_find_unused_parameters + elif isinstance(model, PreTrainedModel): + # find_unused_parameters breaks checkpointing as per + # https://github.com/huggingface/transformers/pull/4659#issuecomment-643356021 + kwargs["find_unused_parameters"] = not model.is_gradient_checkpointing + else: + kwargs["find_unused_parameters"] = True + + if self.args.ddp_bucket_cap_mb is not None: + kwargs["bucket_cap_mb"] = self.args.ddp_bucket_cap_mb + if is_torch_neuroncore_available(): + return model + model = nn.parallel.DistributedDataParallel( + model, + device_ids=[self.args.local_rank] if self.args._n_gpu != 0 else None, + output_device=self.args.local_rank if self.args._n_gpu != 0 else None, + **kwargs, + ) + + # torch.compile() needs to be called after wrapping the model with FSDP or DDP + # to ensure that it accounts for the graph breaks required by those wrappers + if self.args.torch_compile: + model = torch.compile(model, backend=self.args.torch_compile_backend, mode=self.args.torch_compile_mode) + + return model + + def train( + self, + resume_from_checkpoint: Optional[Union[str, bool]] = None, + trial: Union["optuna.Trial", Dict[str, Any]] = None, + ignore_keys_for_eval: Optional[List[str]] = None, + **kwargs, + ): + """ + Main training entry point. + + Args: + resume_from_checkpoint (`str` or `bool`, *optional*): + If a `str`, local path to a saved checkpoint as saved by a previous instance of [`Trainer`]. If a + `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance + of [`Trainer`]. If present, training will resume from the model/optimizer/scheduler states loaded here. + trial (`optuna.Trial` or `Dict[str, Any]`, *optional*): + The trial run or the hyperparameter dictionary for hyperparameter search. + ignore_keys_for_eval (`List[str]`, *optional*) + A list of keys in the output of your model (if it is a dictionary) that should be ignored when + gathering predictions for evaluation during the training. + kwargs: + Additional keyword arguments used to hide deprecated arguments + """ + if resume_from_checkpoint is False: + resume_from_checkpoint = None + + # memory metrics - must set up as early as possible + self._memory_tracker.start() + + args = self.args + + self.is_in_train = True + + # do_train is not a reliable argument, as it might not be set and .train() still called, so + # the following is a workaround: + if (args.fp16_full_eval or args.bf16_full_eval) and not args.do_train: + self._move_model_to_device(self.model, args.device) + + if "model_path" in kwargs: + resume_from_checkpoint = kwargs.pop("model_path") + warnings.warn( + "`model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` " + "instead.", + FutureWarning, + ) + if len(kwargs) > 0: + raise TypeError(f"train() received got unexpected keyword arguments: {', '.join(list(kwargs.keys()))}.") + # This might change the seed so needs to run first. + self._hp_search_setup(trial) + self._train_batch_size = self.args.train_batch_size + + # Model re-init + model_reloaded = False + if self.model_init is not None: + # Seed must be set before instantiating the model when using model_init. + enable_full_determinism(self.args.seed) if self.args.full_determinism else set_seed(self.args.seed) + self.model = self.call_model_init(trial) + model_reloaded = True + # Reinitializes optimizer and scheduler + self.optimizer, self.lr_scheduler = None, None + + # Load potential model checkpoint + if isinstance(resume_from_checkpoint, bool) and resume_from_checkpoint: + resume_from_checkpoint = get_last_checkpoint(args.output_dir) + if resume_from_checkpoint is None: + raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})") + + if resume_from_checkpoint is not None and not is_sagemaker_mp_enabled() and args.deepspeed is None: + self._load_from_checkpoint(resume_from_checkpoint) + + # If model was re-initialized, put it on the right device and update self.model_wrapped + if model_reloaded: + if self.place_model_on_device: + self._move_model_to_device(self.model, args.device) + self.model_wrapped = self.model + + inner_training_loop = find_executable_batch_size( + self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size + ) + return inner_training_loop( + args=args, + resume_from_checkpoint=resume_from_checkpoint, + trial=trial, + ignore_keys_for_eval=ignore_keys_for_eval, + ) + + def _inner_training_loop( + self, batch_size=None, args=None, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None + ): + self._train_batch_size = batch_size + # Data loader and number of training steps + train_dataloader = self.get_train_dataloader() + + # Setting up training control variables: + # number of training epochs: num_train_epochs + # number of training steps per epoch: num_update_steps_per_epoch + # total number of training steps to execute: max_steps + total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size + + len_dataloader = None + if has_length(train_dataloader): + len_dataloader = len(train_dataloader) + num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps + num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1) + num_examples = self.num_examples(train_dataloader) + if args.max_steps > 0: + max_steps = args.max_steps + num_train_epochs = args.max_steps // num_update_steps_per_epoch + int( + args.max_steps % num_update_steps_per_epoch > 0 + ) + # May be slightly incorrect if the last batch in the training dataloader has a smaller size but it's + # the best we can do. + num_train_samples = args.max_steps * total_train_batch_size + else: + max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch) + num_train_epochs = math.ceil(args.num_train_epochs) + num_train_samples = self.num_examples(train_dataloader) * args.num_train_epochs + elif args.max_steps > 0: # Rely on max_steps when dataloader does not have a working size + max_steps = args.max_steps + # Setting a very large number of epochs so we go as many times as necessary over the iterator. + num_train_epochs = sys.maxsize + num_update_steps_per_epoch = max_steps + num_examples = total_train_batch_size * args.max_steps + num_train_samples = args.max_steps * total_train_batch_size + else: + raise ValueError( + "args.max_steps must be set to a positive value if dataloader does not have a length, was" + f" {args.max_steps}" + ) + + if DebugOption.UNDERFLOW_OVERFLOW in self.args.debug: + if self.args.n_gpu > 1: + # nn.DataParallel(model) replicates the model, creating new variables and module + # references registered here no longer work on other gpus, breaking the module + raise ValueError( + "Currently --debug underflow_overflow is not supported under DP. Please use DDP" + " (torch.distributed.launch)." + ) + else: + debug_overflow = DebugUnderflowOverflow(self.model) # noqa + + delay_optimizer_creation = ( + self.sharded_ddp is not None + and self.sharded_ddp != ShardedDDPOption.SIMPLE + or is_sagemaker_mp_enabled() + or self.fsdp is not None + ) + if args.deepspeed: + deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( + self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint + ) + self.model = deepspeed_engine.module + self.model_wrapped = deepspeed_engine + self.deepspeed = deepspeed_engine + self.optimizer = optimizer + self.lr_scheduler = lr_scheduler + elif not delay_optimizer_creation: + self.create_optimizer_and_scheduler(num_training_steps=max_steps) + + self.state = TrainerState() + self.state.is_hyper_param_search = trial is not None + + # Activate gradient checkpointing if needed + if args.gradient_checkpointing: + self.model.gradient_checkpointing_enable() + + model = self._wrap_model(self.model_wrapped) + + if is_sagemaker_mp_enabled() and resume_from_checkpoint is not None: + self._load_from_checkpoint(resume_from_checkpoint, model) + + # for the rest of this function `model` is the outside model, whether it was wrapped or not + if model is not self.model: + self.model_wrapped = model + + if delay_optimizer_creation: + self.create_optimizer_and_scheduler(num_training_steps=max_steps) + + # Check if saved optimizer or scheduler states exist + self._load_optimizer_and_scheduler(resume_from_checkpoint) + + # important: at this point: + # self.model is the Transformers Model + # self.model_wrapped is DDP(Transformers Model), Deepspeed(Transformers Model), etc. + + # Train! + logger.info("***** Running training *****") + logger.info(f" Num examples = {num_examples:,}") + logger.info(f" Num Epochs = {num_train_epochs:,}") + logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size:,}") + logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size:,}") + logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}") + logger.info(f" Total optimization steps = {max_steps:,}") + logger.info(f" Number of trainable parameters = {get_model_param_count(model, trainable_only=True):,}") + + self.state.epoch = 0 + start_time = time.time() + epochs_trained = 0 + steps_trained_in_current_epoch = 0 + steps_trained_progress_bar = None + + # Check if continuing training from a checkpoint + if resume_from_checkpoint is not None and os.path.isfile( + os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME) + ): + self.state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME)) + epochs_trained = self.state.global_step // num_update_steps_per_epoch + if not args.ignore_data_skip: + steps_trained_in_current_epoch = self.state.global_step % (num_update_steps_per_epoch) + steps_trained_in_current_epoch *= args.gradient_accumulation_steps + else: + steps_trained_in_current_epoch = 0 + + logger.info(" Continuing training from checkpoint, will skip to saved global_step") + logger.info(f" Continuing training from epoch {epochs_trained}") + logger.info(f" Continuing training from global step {self.state.global_step}") + if not args.ignore_data_skip: + if skip_first_batches is None: + logger.info( + f" Will skip the first {epochs_trained} epochs then the first" + f" {steps_trained_in_current_epoch} batches in the first epoch. If this takes a lot of time," + " you can install the latest version of Accelerate with `pip install -U accelerate`.You can" + " also add the `--ignore_data_skip` flag to your launch command, but you will resume the" + " training on data already seen by your model." + ) + else: + logger.info( + f" Will skip the first {epochs_trained} epochs then the first" + f" {steps_trained_in_current_epoch} batches in the first epoch." + ) + if self.is_local_process_zero() and not args.disable_tqdm and skip_first_batches is None: + steps_trained_progress_bar = tqdm(total=steps_trained_in_current_epoch) + steps_trained_progress_bar.set_description("Skipping the first batches") + + # Update the references + self.callback_handler.model = self.model + self.callback_handler.optimizer = self.optimizer + self.callback_handler.lr_scheduler = self.lr_scheduler + self.callback_handler.train_dataloader = train_dataloader + if self.hp_name is not None and self._trial is not None: + # use self._trial because the SigOpt/Optuna hpo only call `_hp_search_setup(trial)` instead of passing trial + # parameter to Train when using DDP. + self.state.trial_name = self.hp_name(self._trial) + if trial is not None: + assignments = trial.assignments if self.hp_search_backend == HPSearchBackend.SIGOPT else trial + self.state.trial_params = hp_params(assignments) + else: + self.state.trial_params = None + # This should be the same if the state has been saved but in case the training arguments changed, it's safer + # to set this after the load. + self.state.max_steps = max_steps + self.state.num_train_epochs = num_train_epochs + self.state.is_local_process_zero = self.is_local_process_zero() + self.state.is_world_process_zero = self.is_world_process_zero() + + # tr_loss is a tensor to avoid synchronization of TPUs through .item() + tr_loss = torch.tensor(0.0).to(args.device) + # _total_loss_scalar is updated everytime .item() has to be called on tr_loss and stores the sum of all losses + self._total_loss_scalar = 0.0 + self._globalstep_last_logged = self.state.global_step + model.zero_grad() + + self.control = self.callback_handler.on_train_begin(args, self.state, self.control) + + # Skip the first epochs_trained epochs to get the random state of the dataloader at the right point. + if not args.ignore_data_skip: + for epoch in range(epochs_trained): + is_random_sampler = hasattr(train_dataloader, "sampler") and isinstance( + train_dataloader.sampler, RandomSampler + ) + if is_torch_less_than_1_11 or not is_random_sampler: + # We just need to begin an iteration to create the randomization of the sampler. + # That was before PyTorch 1.11 however... + for _ in train_dataloader: + break + else: + # Otherwise we need to call the whooooole sampler cause there is some random operation added + # AT THE VERY END! + _ = list(train_dataloader.sampler) + + total_batched_samples = 0 + for epoch in range(epochs_trained, num_train_epochs): + if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler): + train_dataloader.sampler.set_epoch(epoch) + elif hasattr(train_dataloader, "dataset") and isinstance(train_dataloader.dataset, IterableDatasetShard): + train_dataloader.dataset.set_epoch(epoch) + + if is_torch_tpu_available(): + parallel_loader = pl.ParallelLoader(train_dataloader, [args.device]).per_device_loader(args.device) + epoch_iterator = parallel_loader + else: + epoch_iterator = train_dataloader + + # Reset the past mems state at the beginning of each epoch if necessary. + if args.past_index >= 0: + self._past = None + + steps_in_epoch = ( + len(epoch_iterator) + if len_dataloader is not None + else args.max_steps * args.gradient_accumulation_steps + ) + self.control = self.callback_handler.on_epoch_begin(args, self.state, self.control) + + if epoch == epochs_trained and resume_from_checkpoint is not None and steps_trained_in_current_epoch == 0: + self._load_rng_state(resume_from_checkpoint) + + rng_to_sync = False + steps_skipped = 0 + if skip_first_batches is not None and steps_trained_in_current_epoch > 0: + epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch) + steps_skipped = steps_trained_in_current_epoch + steps_trained_in_current_epoch = 0 + rng_to_sync = True + + step = -1 + step_start_time = time.time() + for step, inputs in enumerate(epoch_iterator): + total_batched_samples += 1 + if rng_to_sync: + self._load_rng_state(resume_from_checkpoint) + rng_to_sync = False + + # Skip past any already trained steps if resuming training + if steps_trained_in_current_epoch > 0: + steps_trained_in_current_epoch -= 1 + if steps_trained_progress_bar is not None: + steps_trained_progress_bar.update(1) + if steps_trained_in_current_epoch == 0: + self._load_rng_state(resume_from_checkpoint) + continue + elif steps_trained_progress_bar is not None: + steps_trained_progress_bar.close() + steps_trained_progress_bar = None + + if step % args.gradient_accumulation_steps == 0: + self.control = self.callback_handler.on_step_begin(args, self.state, self.control) + + if ( + (total_batched_samples % args.gradient_accumulation_steps != 0) + and args.local_rank != -1 + and args._no_sync_in_gradient_accumulation + ): + # Avoid unnecessary DDP synchronization since there will be no backward pass on this example. + with model.no_sync(): + tr_loss_step = self.training_step(model, inputs) + else: + tr_loss_step = self.training_step(model, inputs) + + if ( + args.logging_nan_inf_filter + and not is_torch_tpu_available() + and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) + ): + # if loss is nan or inf simply add the average of previous logged losses + tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged) + else: + tr_loss += tr_loss_step + + self.current_flos += float(self.floating_point_ops(inputs)) + + # Optimizer step for deepspeed must be called on every step regardless of the value of gradient_accumulation_steps + if self.deepspeed: + self.deepspeed.step() + + if total_batched_samples % args.gradient_accumulation_steps == 0 or ( + # last step in epoch but step is always smaller than gradient_accumulation_steps + steps_in_epoch <= args.gradient_accumulation_steps + and (step + 1) == steps_in_epoch + ): + # Gradient clipping + if args.max_grad_norm is not None and args.max_grad_norm > 0 and not self.deepspeed: + # deepspeed does its own clipping + + if self.do_grad_scaling: + # Reduce gradients first for XLA + if is_torch_tpu_available(): + gradients = xm._fetch_gradients(self.optimizer) + xm.all_reduce("sum", gradients, scale=1.0 / xm.xrt_world_size()) + # AMP: gradients need unscaling + self.scaler.unscale_(self.optimizer) + + if is_sagemaker_mp_enabled() and args.fp16: + self.optimizer.clip_master_grads(args.max_grad_norm) + elif hasattr(self.optimizer, "clip_grad_norm"): + # Some optimizers (like the sharded optimizer) have a specific way to do gradient clipping + self.optimizer.clip_grad_norm(args.max_grad_norm) + elif hasattr(model, "clip_grad_norm_"): + # Some models (like FullyShardedDDP) have a specific way to do gradient clipping + model.clip_grad_norm_(args.max_grad_norm) + else: + # Revert to normal clipping otherwise, handling Apex or full precision + nn.utils.clip_grad_norm_( + amp.master_params(self.optimizer) if self.use_apex else model.parameters(), + args.max_grad_norm, + ) + + # Optimizer step + optimizer_was_run = True + if self.deepspeed: + pass # called outside the loop + elif is_torch_tpu_available(): + if self.do_grad_scaling: + self.scaler.step(self.optimizer) + self.scaler.update() + else: + xm.optimizer_step(self.optimizer) + elif self.do_grad_scaling: + scale_before = self.scaler.get_scale() + self.scaler.step(self.optimizer) + self.scaler.update() + scale_after = self.scaler.get_scale() + optimizer_was_run = scale_before <= scale_after + else: + self.optimizer.step() + + if optimizer_was_run and not self.deepspeed: + self.lr_scheduler.step() + + model.zero_grad() + self.state.global_step += 1 + self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epoch + self.control = self.callback_handler.on_step_end(args, self.state, self.control) + + self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) + else: + self.control = self.callback_handler.on_substep_end(args, self.state, self.control) + + if self.control.should_epoch_stop or self.control.should_training_stop: + break + if torch.distributed.get_rank() == 0: + print("time cost: {}".format(time.time() - step_start_time)) + step_start_time = time.time() + if step < 0: + logger.warning( + "There seems to be not a single sample in your epoch_iterator, stopping training at step" + f" {self.state.global_step}! This is expected if you're using an IterableDataset and set" + f" num_steps ({max_steps}) higher than the number of available samples." + ) + self.control.should_training_stop = True + + self.control = self.callback_handler.on_epoch_end(args, self.state, self.control) + self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) + + if DebugOption.TPU_METRICS_DEBUG in self.args.debug: + if is_torch_tpu_available(): + # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.) + xm.master_print(met.metrics_report()) + else: + logger.warning( + "You enabled PyTorch/XLA debug metrics but you don't have a TPU " + "configured. Check your training configuration if this is unexpected." + ) + if self.control.should_training_stop: + break + + if args.past_index and hasattr(self, "_past"): + # Clean the state at the end of training + delattr(self, "_past") + + logger.info("\n\nTraining completed. Do not forget to share your model on huggingface.co/models =)\n\n") + if args.load_best_model_at_end and self.state.best_model_checkpoint is not None: + # Wait for everyone to get here so we are sur the model has been saved by process 0. + if is_torch_tpu_available(): + xm.rendezvous("load_best_model_at_end") + elif args.local_rank != -1: + dist.barrier() + elif is_sagemaker_mp_enabled(): + smp.barrier() + + self._load_best_model() + + # add remaining tr_loss + self._total_loss_scalar += tr_loss.item() + train_loss = self._total_loss_scalar / self.state.global_step + + metrics = speed_metrics("train", start_time, num_samples=num_train_samples, num_steps=self.state.max_steps) + self.store_flos() + metrics["total_flos"] = self.state.total_flos + metrics["train_loss"] = train_loss + + self.is_in_train = False + + self._memory_tracker.stop_and_update_metrics(metrics) + + self.log(metrics) + + run_dir = self._get_output_dir(trial) + checkpoints_sorted = self._sorted_checkpoints(use_mtime=False, output_dir=run_dir) + + # Delete the last checkpoint when save_total_limit=1 if it's different from the best checkpoint and process allowed to save. + if self.args.should_save and self.state.best_model_checkpoint is not None and self.args.save_total_limit == 1: + for checkpoint in checkpoints_sorted: + if checkpoint != self.state.best_model_checkpoint: + logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit") + shutil.rmtree(checkpoint) + + self.control = self.callback_handler.on_train_end(args, self.state, self.control) + + return TrainOutput(self.state.global_step, train_loss, metrics) + + def _get_output_dir(self, trial): + if self.hp_search_backend is not None and trial is not None: + if self.hp_search_backend == HPSearchBackend.OPTUNA: + run_id = trial.number + elif self.hp_search_backend == HPSearchBackend.RAY: + from ray import tune + + run_id = tune.get_trial_id() + elif self.hp_search_backend == HPSearchBackend.SIGOPT: + run_id = trial.id + elif self.hp_search_backend == HPSearchBackend.WANDB: + import wandb + + run_id = wandb.run.id + run_name = self.hp_name(trial) if self.hp_name is not None else f"run-{run_id}" + run_dir = os.path.join(self.args.output_dir, run_name) + else: + run_dir = self.args.output_dir + return run_dir + + def _load_from_checkpoint(self, resume_from_checkpoint, model=None): + if model is None: + model = self.model + + config_file = os.path.join(resume_from_checkpoint, CONFIG_NAME) + + weights_file = os.path.join(resume_from_checkpoint, WEIGHTS_NAME) + weights_index_file = os.path.join(resume_from_checkpoint, WEIGHTS_INDEX_NAME) + safe_weights_file = os.path.join(resume_from_checkpoint, SAFE_WEIGHTS_NAME) + safe_weights_index_file = os.path.join(resume_from_checkpoint, SAFE_WEIGHTS_INDEX_NAME) + + if not any( + [os.path.isfile(f) for f in [weights_file, safe_weights_file, weights_index_file, safe_weights_index_file]] + ): + raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}") + + logger.info(f"Loading model from {resume_from_checkpoint}.") + + if os.path.isfile(config_file): + config = PretrainedConfig.from_json_file(config_file) + checkpoint_version = config.transformers_version + if checkpoint_version is not None and checkpoint_version != __version__: + logger.warning( + f"You are resuming training from a checkpoint trained with {checkpoint_version} of " + f"Transformers but your current version is {__version__}. This is not recommended and could " + "yield to errors or unwanted behaviors." + ) + + if os.path.isfile(weights_file) or os.path.isfile(safe_weights_file): + # If the model is on the GPU, it still works! + if is_sagemaker_mp_enabled(): + if os.path.isfile(os.path.join(resume_from_checkpoint, "user_content.pt")): + # If the 'user_content.pt' file exists, load with the new smp api. + # Checkpoint must have been saved with the new smp api. + smp.resume_from_checkpoint( + path=resume_from_checkpoint, tag=WEIGHTS_NAME, partial=False, load_optimizer=False + ) + else: + # If the 'user_content.pt' file does NOT exist, load with the old smp api. + # Checkpoint must have been saved with the old smp api. + if hasattr(self.args, "fp16") and self.args.fp16 is True: + logger.warning( + "Enabling FP16 and loading from smp < 1.10 checkpoint together is not suppported." + ) + state_dict = torch.load(weights_file, map_location="cpu") + # Required for smp to not auto-translate state_dict from hf to smp (is already smp). + state_dict["_smp_is_partial"] = False + load_result = model.load_state_dict(state_dict, strict=True) + # release memory + del state_dict + else: + # We load the model state dict on the CPU to avoid an OOM error. + if self.args.save_safetensors and os.path.isfile(safe_weights_file): + state_dict = safetensors.torch.load_file(safe_weights_file, device="cpu") + else: + state_dict = torch.load(weights_file, map_location="cpu") + + # workaround for FSDP bug https://github.com/pytorch/pytorch/issues/82963 + # which takes *args instead of **kwargs + load_result = model.load_state_dict(state_dict, False) + # release memory + del state_dict + self._issue_warnings_after_load(load_result) + else: + # We load the sharded checkpoint + load_result = load_sharded_checkpoint( + model, resume_from_checkpoint, strict=is_sagemaker_mp_enabled(), prefer_safe=self.args.save_safetensors + ) + if not is_sagemaker_mp_enabled(): + self._issue_warnings_after_load(load_result) + + def _load_best_model(self): + logger.info(f"Loading best model from {self.state.best_model_checkpoint} (score: {self.state.best_metric}).") + best_model_path = os.path.join(self.state.best_model_checkpoint, WEIGHTS_NAME) + best_safe_model_path = os.path.join(self.state.best_model_checkpoint, SAFE_WEIGHTS_NAME) + model = self.model_wrapped if is_sagemaker_mp_enabled() else self.model + if os.path.exists(best_model_path) or os.path.exists(best_safe_model_path): + if self.deepspeed: + if self.model_wrapped is not None: + # this removes the pre-hooks from the previous engine + self.model_wrapped.destroy() + self.model_wrapped = None + + # temp hack until Deepspeed fixes the problem with resume from an existing engine that did some stepping + deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( + self, + num_training_steps=self.args.max_steps, + resume_from_checkpoint=self.state.best_model_checkpoint, + ) + self.model = deepspeed_engine.module + self.model_wrapped = deepspeed_engine + self.deepspeed = deepspeed_engine + self.optimizer = optimizer + self.lr_scheduler = lr_scheduler + else: + if is_sagemaker_mp_enabled(): + if os.path.isfile(os.path.join(self.state.best_model_checkpoint, "user_content.pt")): + # If the 'user_content.pt' file exists, load with the new smp api. + # Checkpoint must have been saved with the new smp api. + smp.resume_from_checkpoint( + path=self.state.best_model_checkpoint, + tag=WEIGHTS_NAME, + partial=False, + load_optimizer=False, + ) + else: + # If the 'user_content.pt' file does NOT exist, load with the old smp api. + # Checkpoint must have been saved with the old smp api. + if self.args.save_safetensors and os.path.isfile(best_safe_model_path): + state_dict = safetensors.torch.load_file(best_safe_model_path, device="cpu") + else: + state_dict = torch.load(best_model_path, map_location="cpu") + + state_dict["_smp_is_partial"] = False + load_result = model.load_state_dict(state_dict, strict=True) + else: + # We load the model state dict on the CPU to avoid an OOM error. + if self.args.save_safetensors and os.path.isfile(best_safe_model_path): + state_dict = safetensors.torch.load_file(best_safe_model_path, device="cpu") + else: + state_dict = torch.load(best_model_path, map_location="cpu") + + # If the model is on the GPU, it still works! + # workaround for FSDP bug https://github.com/pytorch/pytorch/issues/82963 + # which takes *args instead of **kwargs + load_result = model.load_state_dict(state_dict, False) + if not is_sagemaker_mp_enabled(): + self._issue_warnings_after_load(load_result) + elif os.path.exists(os.path.join(self.state.best_model_checkpoint, WEIGHTS_INDEX_NAME)): + load_result = load_sharded_checkpoint( + model, self.state.best_model_checkpoint, strict=is_sagemaker_mp_enabled() + ) + if not is_sagemaker_mp_enabled(): + self._issue_warnings_after_load(load_result) + else: + logger.warning( + f"Could not locate the best model at {best_model_path}, if you are running a distributed training " + "on multiple nodes, you should activate `--save_on_each_node`." + ) + + def _issue_warnings_after_load(self, load_result): + if len(load_result.missing_keys) != 0: + if self.model._keys_to_ignore_on_save is not None and set(load_result.missing_keys) == set( + self.model._keys_to_ignore_on_save + ): + self.model.tie_weights() + else: + logger.warning(f"There were missing keys in the checkpoint model loaded: {load_result.missing_keys}.") + if len(load_result.unexpected_keys) != 0: + logger.warning( + f"There were unexpected keys in the checkpoint model loaded: {load_result.unexpected_keys}." + ) + + def _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval): + if self.control.should_log: + if is_torch_tpu_available(): + xm.mark_step() + + logs: Dict[str, float] = {} + + # all_gather + mean() to get average loss over all processes + tr_loss_scalar = self._nested_gather(tr_loss).mean().item() + + # reset tr_loss to zero + tr_loss -= tr_loss + + logs["loss"] = round(tr_loss_scalar / (self.state.global_step - self._globalstep_last_logged), 4) + logs["learning_rate"] = self._get_learning_rate() + + self._total_loss_scalar += tr_loss_scalar + self._globalstep_last_logged = self.state.global_step + self.store_flos() + + self.log(logs) + + metrics = None + if self.control.should_evaluate: + if isinstance(self.eval_dataset, dict): + metrics = {} + for eval_dataset_name, eval_dataset in self.eval_dataset.items(): + dataset_metrics = self.evaluate( + eval_dataset=eval_dataset, + ignore_keys=ignore_keys_for_eval, + metric_key_prefix=f"eval_{eval_dataset_name}", + ) + metrics.update(dataset_metrics) + else: + metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) + self._report_to_hp_search(trial, self.state.global_step, metrics) + + if self.control.should_save: + self._save_checkpoint(model, trial, metrics=metrics) + self.control = self.callback_handler.on_save(self.args, self.state, self.control) + + def _load_rng_state(self, checkpoint): + # Load RNG states from `checkpoint` + if checkpoint is None: + return + + if self.args.world_size > 1: + process_index = self.args.process_index + rng_file = os.path.join(checkpoint, f"rng_state_{process_index}.pth") + if not os.path.isfile(rng_file): + logger.info( + f"Didn't find an RNG file for process {process_index}, if you are resuming a training that " + "wasn't launched in a distributed fashion, reproducibility is not guaranteed." + ) + return + else: + rng_file = os.path.join(checkpoint, "rng_state.pth") + if not os.path.isfile(rng_file): + logger.info( + "Didn't find an RNG file, if you are resuming a training that was launched in a distributed " + "fashion, reproducibility is not guaranteed." + ) + return + + checkpoint_rng_state = torch.load(rng_file) + random.setstate(checkpoint_rng_state["python"]) + np.random.set_state(checkpoint_rng_state["numpy"]) + torch.random.set_rng_state(checkpoint_rng_state["cpu"]) + if torch.cuda.is_available(): + if self.args.local_rank != -1: + torch.cuda.random.set_rng_state(checkpoint_rng_state["cuda"]) + else: + try: + torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"]) + except Exception as e: + logger.info( + f"Didn't manage to set back the RNG states of the GPU because of the following error:\n {e}" + "\nThis won't yield the same results as if the training had not been interrupted." + ) + if is_torch_tpu_available(): + xm.set_rng_state(checkpoint_rng_state["xla"]) + + def _save_checkpoint(self, model, trial, metrics=None): + # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we + # want to save except FullyShardedDDP. + # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model" + + # Save model checkpoint + checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}" + + if self.hp_search_backend is None and trial is None: + self.store_flos() + + run_dir = self._get_output_dir(trial=trial) + output_dir = os.path.join(run_dir, checkpoint_folder) + self.save_model(output_dir, _internal_call=True) + if self.deepspeed: + # under zero3 model file itself doesn't get saved since it's bogus! Unless deepspeed + # config `stage3_gather_16bit_weights_on_model_save` is True + self.deepspeed.save_checkpoint(output_dir) + + # Save optimizer and scheduler + if self.sharded_ddp == ShardedDDPOption.SIMPLE: + self.optimizer.consolidate_state_dict() + + if is_torch_tpu_available(): + xm.rendezvous("saving_optimizer_states") + xm.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME)) + with warnings.catch_warnings(record=True) as caught_warnings: + xm.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME)) + reissue_pt_warnings(caught_warnings) + elif is_sagemaker_mp_enabled(): + opt_state_dict = self.optimizer.local_state_dict(gather_if_shard=False) + smp.barrier() + if smp.rdp_rank() == 0 or smp.state.cfg.shard_optimizer_state: + smp.save( + opt_state_dict, + os.path.join(output_dir, OPTIMIZER_NAME), + partial=True, + v3=smp.state.cfg.shard_optimizer_state, + ) + if self.args.should_save: + with warnings.catch_warnings(record=True) as caught_warnings: + torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME)) + reissue_pt_warnings(caught_warnings) + if self.do_grad_scaling: + torch.save(self.scaler.state_dict(), os.path.join(output_dir, SCALER_NAME)) + elif self.args.should_save and not self.deepspeed: + # deepspeed.save_checkpoint above saves model/optim/sched + torch.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME)) + with warnings.catch_warnings(record=True) as caught_warnings: + torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME)) + reissue_pt_warnings(caught_warnings) + if self.do_grad_scaling: + torch.save(self.scaler.state_dict(), os.path.join(output_dir, SCALER_NAME)) + + # Determine the new best metric / best model checkpoint + if metrics is not None and self.args.metric_for_best_model is not None: + metric_to_check = self.args.metric_for_best_model + if not metric_to_check.startswith("eval_"): + metric_to_check = f"eval_{metric_to_check}" + metric_value = metrics[metric_to_check] + + operator = np.greater if self.args.greater_is_better else np.less + if ( + self.state.best_metric is None + or self.state.best_model_checkpoint is None + or operator(metric_value, self.state.best_metric) + ): + self.state.best_metric = metric_value + self.state.best_model_checkpoint = output_dir + + # Save the Trainer state + if self.args.should_save: + self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME)) + + # Save RNG state in non-distributed training + rng_states = { + "python": random.getstate(), + "numpy": np.random.get_state(), + "cpu": torch.random.get_rng_state(), + } + if torch.cuda.is_available(): + if self.args.local_rank == -1: + # In non distributed, we save the global CUDA RNG state (will take care of DataParallel) + rng_states["cuda"] = torch.cuda.random.get_rng_state_all() + else: + rng_states["cuda"] = torch.cuda.random.get_rng_state() + + if is_torch_tpu_available(): + rng_states["xla"] = xm.get_rng_state() + + # A process can arrive here before the process 0 has a chance to save the model, in which case output_dir may + # not yet exist. + os.makedirs(output_dir, exist_ok=True) + + if self.args.world_size <= 1: + torch.save(rng_states, os.path.join(output_dir, "rng_state.pth")) + else: + torch.save(rng_states, os.path.join(output_dir, f"rng_state_{self.args.process_index}.pth")) + + if self.args.push_to_hub: + self._push_from_checkpoint(output_dir) + + # Maybe delete some older checkpoints. + if self.args.should_save: + self._rotate_checkpoints(use_mtime=True, output_dir=run_dir) + + def _load_optimizer_and_scheduler(self, checkpoint): + """If optimizer and scheduler states exist, load them.""" + if checkpoint is None: + return + + if self.deepspeed: + # deepspeed loads optimizer/lr_scheduler together with the model in deepspeed_init + return + + checkpoint_file_exists = ( + glob.glob(os.path.join(checkpoint, OPTIMIZER_NAME) + "_*") + if is_sagemaker_mp_enabled() + else os.path.isfile(os.path.join(checkpoint, OPTIMIZER_NAME)) + ) + if checkpoint_file_exists and os.path.isfile(os.path.join(checkpoint, SCHEDULER_NAME)): + # Load in optimizer and scheduler states + if is_torch_tpu_available(): + # On TPU we have to take some extra precautions to properly load the states on the right device. + optimizer_state = torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location="cpu") + with warnings.catch_warnings(record=True) as caught_warnings: + lr_scheduler_state = torch.load(os.path.join(checkpoint, SCHEDULER_NAME), map_location="cpu") + reissue_pt_warnings(caught_warnings) + + xm.send_cpu_data_to_device(optimizer_state, self.args.device) + xm.send_cpu_data_to_device(lr_scheduler_state, self.args.device) + + self.optimizer.load_state_dict(optimizer_state) + self.lr_scheduler.load_state_dict(lr_scheduler_state) + else: + if is_sagemaker_mp_enabled(): + if os.path.isfile(os.path.join(checkpoint, "user_content.pt")): + # Optimizer checkpoint was saved with smp >= 1.10 + def opt_load_hook(mod, opt): + opt.load_state_dict(smp.load(os.path.join(checkpoint, OPTIMIZER_NAME), partial=True)) + + else: + # Optimizer checkpoint was saved with smp < 1.10 + def opt_load_hook(mod, opt): + if IS_SAGEMAKER_MP_POST_1_10: + opt.load_state_dict( + smp.load(os.path.join(checkpoint, OPTIMIZER_NAME), partial=True, back_compat=True) + ) + else: + opt.load_state_dict(smp.load(os.path.join(checkpoint, OPTIMIZER_NAME), partial=True)) + + self.model_wrapped.register_post_step_hook(opt_load_hook) + else: + # We use the CPU when training on one GPU to avoid OOM for GPU RAM when training big models. + # In distributed training however, we load directly on each GPU and risk the GPU OOM as it's more + # likely to get OOM on CPU (since we load num_gpu times the optimizer state + map_location = self.args.device if self.args.world_size > 1 else "cpu" + self.optimizer.load_state_dict( + torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location) + ) + with warnings.catch_warnings(record=True) as caught_warnings: + self.lr_scheduler.load_state_dict(torch.load(os.path.join(checkpoint, SCHEDULER_NAME))) + reissue_pt_warnings(caught_warnings) + if self.do_grad_scaling and os.path.isfile(os.path.join(checkpoint, SCALER_NAME)): + self.scaler.load_state_dict(torch.load(os.path.join(checkpoint, SCALER_NAME))) + + def hyperparameter_search( + self, + hp_space: Optional[Callable[["optuna.Trial"], Dict[str, float]]] = None, + compute_objective: Optional[Callable[[Dict[str, float]], float]] = None, + n_trials: int = 20, + direction: str = "minimize", + backend: Optional[Union["str", HPSearchBackend]] = None, + hp_name: Optional[Callable[["optuna.Trial"], str]] = None, + **kwargs, + ) -> BestRun: + """ + Launch an hyperparameter search using `optuna` or `Ray Tune` or `SigOpt`. The optimized quantity is determined + by `compute_objective`, which defaults to a function returning the evaluation loss when no metric is provided, + the sum of all metrics otherwise. + + + + To use this method, you need to have provided a `model_init` when initializing your [`Trainer`]: we need to + reinitialize the model at each new run. This is incompatible with the `optimizers` argument, so you need to + subclass [`Trainer`] and override the method [`~Trainer.create_optimizer_and_scheduler`] for custom + optimizer/scheduler. + + + + Args: + hp_space (`Callable[["optuna.Trial"], Dict[str, float]]`, *optional*): + A function that defines the hyperparameter search space. Will default to + [`~trainer_utils.default_hp_space_optuna`] or [`~trainer_utils.default_hp_space_ray`] or + [`~trainer_utils.default_hp_space_sigopt`] depending on your backend. + compute_objective (`Callable[[Dict[str, float]], float]`, *optional*): + A function computing the objective to minimize or maximize from the metrics returned by the `evaluate` + method. Will default to [`~trainer_utils.default_compute_objective`]. + n_trials (`int`, *optional*, defaults to 100): + The number of trial runs to test. + direction (`str`, *optional*, defaults to `"minimize"`): + Whether to optimize greater or lower objects. Can be `"minimize"` or `"maximize"`, you should pick + `"minimize"` when optimizing the validation loss, `"maximize"` when optimizing one or several metrics. + backend (`str` or [`~training_utils.HPSearchBackend`], *optional*): + The backend to use for hyperparameter search. Will default to optuna or Ray Tune or SigOpt, depending + on which one is installed. If all are installed, will default to optuna. + hp_name (`Callable[["optuna.Trial"], str]]`, *optional*): + A function that defines the trial/run name. Will default to None. + kwargs (`Dict[str, Any]`, *optional*): + Additional keyword arguments passed along to `optuna.create_study` or `ray.tune.run`. For more + information see: + + - the documentation of + [optuna.create_study](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.create_study.html) + - the documentation of [tune.run](https://docs.ray.io/en/latest/tune/api_docs/execution.html#tune-run) + - the documentation of [sigopt](https://app.sigopt.com/docs/endpoints/experiments/create) + + Returns: + [`trainer_utils.BestRun`]: All the information about the best run. Experiment summary can be found in + `run_summary` attribute for Ray backend. + """ + if backend is None: + backend = default_hp_search_backend() + if backend is None: + raise RuntimeError( + "At least one of optuna or ray should be installed. " + "To install optuna run `pip install optuna`. " + "To install ray run `pip install ray[tune]`. " + "To install sigopt run `pip install sigopt`." + ) + backend = HPSearchBackend(backend) + if backend == HPSearchBackend.OPTUNA and not is_optuna_available(): + raise RuntimeError("You picked the optuna backend, but it is not installed. Use `pip install optuna`.") + if backend == HPSearchBackend.RAY and not is_ray_tune_available(): + raise RuntimeError( + "You picked the Ray Tune backend, but it is not installed. Use `pip install 'ray[tune]'`." + ) + if backend == HPSearchBackend.SIGOPT and not is_sigopt_available(): + raise RuntimeError("You picked the sigopt backend, but it is not installed. Use `pip install sigopt`.") + if backend == HPSearchBackend.WANDB and not is_wandb_available(): + raise RuntimeError("You picked the wandb backend, but it is not installed. Use `pip install wandb`.") + self.hp_search_backend = backend + if self.model_init is None: + raise RuntimeError( + "To use hyperparameter search, you need to pass your model through a model_init function." + ) + + self.hp_space = default_hp_space[backend] if hp_space is None else hp_space + self.hp_name = hp_name + self.compute_objective = default_compute_objective if compute_objective is None else compute_objective + + backend_dict = { + HPSearchBackend.OPTUNA: run_hp_search_optuna, + HPSearchBackend.RAY: run_hp_search_ray, + HPSearchBackend.SIGOPT: run_hp_search_sigopt, + HPSearchBackend.WANDB: run_hp_search_wandb, + } + best_run = backend_dict[backend](self, n_trials, direction, **kwargs) + + self.hp_search_backend = None + return best_run + + def log(self, logs: Dict[str, float]) -> None: + """ + Log `logs` on the various objects watching training. + + Subclass and override this method to inject custom behavior. + + Args: + logs (`Dict[str, float]`): + The values to log. + """ + if self.state.epoch is not None: + logs["epoch"] = round(self.state.epoch, 2) + + output = {**logs, **{"step": self.state.global_step}} + self.state.log_history.append(output) + self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs) + + def _prepare_input(self, data: Union[torch.Tensor, Any]) -> Union[torch.Tensor, Any]: + """ + Prepares one `data` before feeding it to the model, be it a tensor or a nested list/dictionary of tensors. + """ + if isinstance(data, Mapping): + return type(data)({k: self._prepare_input(v) for k, v in data.items()}) + elif isinstance(data, (tuple, list)): + return type(data)(self._prepare_input(v) for v in data) + elif isinstance(data, torch.Tensor): + kwargs = {"device": self.args.device} + if self.deepspeed and (torch.is_floating_point(data) or torch.is_complex(data)): + # NLP models inputs are int/uint and those get adjusted to the right dtype of the + # embedding. Other models such as wav2vec2's inputs are already float and thus + # may need special handling to match the dtypes of the model + kwargs.update({"dtype": self.args.hf_deepspeed_config.dtype()}) + return data.to(**kwargs) + return data + + def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]: + """ + Prepare `inputs` before feeding them to the model, converting them to tensors if they are not already and + handling potential state. + """ + inputs = self._prepare_input(inputs) + if len(inputs) == 0: + raise ValueError( + "The batch received was empty, your model won't be able to train on it. Double-check that your " + f"training dataset contains keys expected by the model: {','.join(self._signature_columns)}." + ) + if self.args.past_index >= 0 and self._past is not None: + inputs["mems"] = self._past + + return inputs + + def compute_loss_context_manager(self): + """ + A helper wrapper to group together context managers. + """ + return self.autocast_smart_context_manager() + + def autocast_smart_context_manager(self, cache_enabled: Optional[bool] = True): + """ + A helper wrapper that creates an appropriate context manager for `autocast` while feeding it the desired + arguments, depending on the situation. + """ + if self.use_cuda_amp or self.use_cpu_amp: + if is_torch_greater_or_equal_than_1_10: + ctx_manager = ( + torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype) + if self.use_cpu_amp + else torch.cuda.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype) + ) + else: + ctx_manager = torch.cuda.amp.autocast() + else: + ctx_manager = contextlib.nullcontext() if sys.version_info >= (3, 7) else contextlib.suppress() + + return ctx_manager + + def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor: + """ + Perform a training step on a batch of inputs. + + Subclass and override to inject custom behavior. + + Args: + model (`nn.Module`): + The model to train. + inputs (`Dict[str, Union[torch.Tensor, Any]]`): + The inputs and targets of the model. + + The dictionary will be unpacked before being fed to the model. Most models expect the targets under the + argument `labels`. Check your model's documentation for all accepted arguments. + + Return: + `torch.Tensor`: The tensor with training loss on this batch. + """ + model.train() + inputs = self._prepare_inputs(inputs) + + if is_sagemaker_mp_enabled(): + loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps) + return loss_mb.reduce_mean().detach().to(self.args.device) + + with self.compute_loss_context_manager(): + loss = self.compute_loss(model, inputs) + + if self.args.n_gpu > 1: + loss = loss.mean() # mean() to average on multi-gpu parallel training + + if self.args.gradient_accumulation_steps > 1 and not self.deepspeed: + # deepspeed handles loss scaling by gradient_accumulation_steps in its `backward` + loss = loss / self.args.gradient_accumulation_steps + + if self.do_grad_scaling: + self.scaler.scale(loss).backward() + elif self.use_apex: + with amp.scale_loss(loss, self.optimizer) as scaled_loss: + scaled_loss.backward() + elif self.deepspeed: + # loss gets scaled under gradient_accumulation_steps in deepspeed + loss = self.deepspeed.backward(loss) + else: + loss.backward() + + return loss.detach() + + def compute_loss(self, model, inputs, return_outputs=False): + """ + How the loss is computed by Trainer. By default, all models return the loss in the first element. + + Subclass and override for custom behavior. + """ + if self.label_smoother is not None and "labels" in inputs: + labels = inputs.pop("labels") + else: + labels = None + outputs = model(**inputs) + # Save past state if it exists + # TODO: this needs to be fixed and made cleaner later. + if self.args.past_index >= 0: + self._past = outputs[self.args.past_index] + + if labels is not None: + if unwrap_model(model)._get_name() in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values(): + loss = self.label_smoother(outputs, labels, shift_labels=True) + else: + loss = self.label_smoother(outputs, labels) + else: + if isinstance(outputs, dict) and "loss" not in outputs: + raise ValueError( + "The model did not return a loss from the inputs, only the following keys: " + f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}." + ) + # We don't use .loss here since the model may return tuples instead of ModelOutput. + loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0] + + return (loss, outputs) if return_outputs else loss + + def is_local_process_zero(self) -> bool: + """ + Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several + machines) main process. + """ + return self.args.local_process_index == 0 + + def is_world_process_zero(self) -> bool: + """ + Whether or not this process is the global main process (when training in a distributed fashion on several + machines, this is only going to be `True` for one process). + """ + # Special case for SageMaker ModelParallel since there process_index is dp_process_index, not the global + # process index. + if is_sagemaker_mp_enabled(): + return smp.rank() == 0 + else: + return self.args.process_index == 0 + + def save_model(self, output_dir: Optional[str] = None, _internal_call: bool = False): + """ + Will save the model, so you can reload it using `from_pretrained()`. + + Will only save from the main process. + """ + + if output_dir is None: + output_dir = self.args.output_dir + + if is_torch_tpu_available(): + self._save_tpu(output_dir) + elif is_sagemaker_mp_enabled(): + # Calling the state_dict needs to be done on the wrapped model and on all processes. + os.makedirs(output_dir, exist_ok=True) + state_dict = self.model_wrapped.state_dict() + if self.args.should_save: + self._save(output_dir, state_dict=state_dict) + if IS_SAGEMAKER_MP_POST_1_10: + # 'user_content.pt' indicates model state_dict saved with smp >= 1.10 + Path(os.path.join(output_dir, "user_content.pt")).touch() + elif ( + ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp + or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp + or self.fsdp is not None + ): + state_dict = self.model.state_dict() + + if self.args.should_save: + self._save(output_dir, state_dict=state_dict) + elif self.deepspeed: + # this takes care of everything as long as we aren't under zero3 + if self.args.should_save: + self._save(output_dir) + + if is_deepspeed_zero3_enabled(): + # It's too complicated to try to override different places where the weights dump gets + # saved, so since under zero3 the file is bogus, simply delete it. The user should + # either user deepspeed checkpoint to resume or to recover full weights use + # zero_to_fp32.py stored in the checkpoint. + if self.args.should_save: + file = os.path.join(output_dir, WEIGHTS_NAME) + if os.path.isfile(file): + # logger.info(f"deepspeed zero3: removing {file}, see zero_to_fp32.py to recover weights") + os.remove(file) + + # now save the real model if stage3_gather_16bit_weights_on_model_save=True + # if false it will not be saved. + # This must be called on all ranks + if not self.deepspeed.save_16bit_model(output_dir, WEIGHTS_NAME): + logger.warning( + "deepspeed.save_16bit_model didn't save the model, since" + " stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use" + " zero_to_fp32.py to recover weights" + ) + self.deepspeed.save_checkpoint(output_dir) + + elif self.args.should_save: + self._save(output_dir) + + # Push to the Hub when `save_model` is called by the user. + if self.args.push_to_hub and not _internal_call: + self.push_to_hub(commit_message="Model save") + + def _save_tpu(self, output_dir: Optional[str] = None): + output_dir = output_dir if output_dir is not None else self.args.output_dir + logger.info(f"Saving model checkpoint to {output_dir}") + + if xm.is_master_ordinal(): + os.makedirs(output_dir, exist_ok=True) + torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME)) + + # Save a trained model and configuration using `save_pretrained()`. + # They can then be reloaded using `from_pretrained()` + xm.rendezvous("saving_checkpoint") + if not isinstance(self.model, PreTrainedModel): + if isinstance(unwrap_model(self.model), PreTrainedModel): + unwrap_model(self.model).save_pretrained( + output_dir, + is_main_process=self.args.should_save, + state_dict=self.model.state_dict(), + save_function=xm.save, + ) + else: + logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.") + state_dict = self.model.state_dict() + xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME)) + else: + self.model.save_pretrained(output_dir, is_main_process=self.args.should_save, save_function=xm.save) + if self.tokenizer is not None and self.args.should_save: + self.tokenizer.save_pretrained(output_dir) + + def _save(self, output_dir: Optional[str] = None, state_dict=None): + # If we are executing this function, we are the process zero, so we don't check for that. + output_dir = output_dir if output_dir is not None else self.args.output_dir + os.makedirs(output_dir, exist_ok=True) + logger.info(f"Saving model checkpoint to {output_dir}") + # Save a trained model and configuration using `save_pretrained()`. + # They can then be reloaded using `from_pretrained()` + if not isinstance(self.model, PreTrainedModel): + if state_dict is None: + state_dict = self.model.state_dict() + + if isinstance(unwrap_model(self.model), PreTrainedModel): + unwrap_model(self.model).save_pretrained( + output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors + ) + else: + logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.") + if self.args.save_safetensors: + safetensors.torch.save_file(state_dict, os.path.join(output_dir, SAFE_WEIGHTS_NAME)) + else: + torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME)) + else: + self.model.save_pretrained( + output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors + ) + + if self.tokenizer is not None: + self.tokenizer.save_pretrained(output_dir) + + # Good practice: save your training arguments together with the trained model + torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME)) + + def store_flos(self): + # Storing the number of floating-point operations that went into the model + if self.args.local_rank != -1: + self.state.total_flos += ( + distributed_broadcast_scalars([self.current_flos], device=self.args.device).sum().item() + ) + self.current_flos = 0 + else: + self.state.total_flos += self.current_flos + self.current_flos = 0 + + def _sorted_checkpoints( + self, output_dir=None, checkpoint_prefix=PREFIX_CHECKPOINT_DIR, use_mtime=False + ) -> List[str]: + ordering_and_checkpoint_path = [] + + glob_checkpoints = [str(x) for x in Path(output_dir).glob(f"{checkpoint_prefix}-*") if os.path.isdir(x)] + + for path in glob_checkpoints: + if use_mtime: + ordering_and_checkpoint_path.append((os.path.getmtime(path), path)) + else: + regex_match = re.match(f".*{checkpoint_prefix}-([0-9]+)", path) + if regex_match is not None and regex_match.groups() is not None: + ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path)) + + checkpoints_sorted = sorted(ordering_and_checkpoint_path) + checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted] + # Make sure we don't delete the best model. + if self.state.best_model_checkpoint is not None: + best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint))) + for i in range(best_model_index, len(checkpoints_sorted) - 2): + checkpoints_sorted[i], checkpoints_sorted[i + 1] = checkpoints_sorted[i + 1], checkpoints_sorted[i] + return checkpoints_sorted + + def _rotate_checkpoints(self, use_mtime=False, output_dir=None) -> None: + if self.args.save_total_limit is None or self.args.save_total_limit <= 0: + return + + # Check if we should delete older checkpoint(s) + checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime, output_dir=output_dir) + if len(checkpoints_sorted) <= self.args.save_total_limit: + return + + # If save_total_limit=1 with load_best_model_at_end=True, we could end up deleting the last checkpoint, which + # we don't do to allow resuming. + save_total_limit = self.args.save_total_limit + if ( + self.state.best_model_checkpoint is not None + and self.args.save_total_limit == 1 + and checkpoints_sorted[-1] != self.state.best_model_checkpoint + ): + save_total_limit = 2 + + number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - save_total_limit) + checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete] + for checkpoint in checkpoints_to_be_deleted: + logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit") + shutil.rmtree(checkpoint, ignore_errors=True) + + def evaluate( + self, + eval_dataset: Optional[Dataset] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + ) -> Dict[str, float]: + """ + Run evaluation and returns metrics. + + The calling script will be responsible for providing a method to compute metrics, as they are task-dependent + (pass it to the init `compute_metrics` argument). + + You can also subclass and override this method to inject custom behavior. + + Args: + eval_dataset (`Dataset`, *optional*): + Pass a dataset if you wish to override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns + not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__` + method. + ignore_keys (`List[str]`, *optional*): + A list of keys in the output of your model (if it is a dictionary) that should be ignored when + gathering predictions. + metric_key_prefix (`str`, *optional*, defaults to `"eval"`): + An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named + "eval_bleu" if the prefix is "eval" (default) + + Returns: + A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The + dictionary also contains the epoch number which comes from the training state. + """ + # memory metrics - must set up as early as possible + self._memory_tracker.start() + + eval_dataloader = self.get_eval_dataloader(eval_dataset) + start_time = time.time() + + eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if self.compute_metrics is None else None, + ignore_keys=ignore_keys, + metric_key_prefix=metric_key_prefix, + ) + + total_batch_size = self.args.eval_batch_size * self.args.world_size + if f"{metric_key_prefix}_jit_compilation_time" in output.metrics: + start_time += output.metrics[f"{metric_key_prefix}_jit_compilation_time"] + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + if DebugOption.TPU_METRICS_DEBUG in self.args.debug: + # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.) + xm.master_print(met.metrics_report()) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + + self._memory_tracker.stop_and_update_metrics(output.metrics) + + return output.metrics + + def predict( + self, test_dataset: Dataset, ignore_keys: Optional[List[str]] = None, metric_key_prefix: str = "test" + ) -> PredictionOutput: + """ + Run prediction and returns predictions and potential metrics. + + Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method + will also return metrics, like in `evaluate()`. + + Args: + test_dataset (`Dataset`): + Dataset to run the predictions on. If it is an `datasets.Dataset`, columns not accepted by the + `model.forward()` method are automatically removed. Has to implement the method `__len__` + ignore_keys (`List[str]`, *optional*): + A list of keys in the output of your model (if it is a dictionary) that should be ignored when + gathering predictions. + metric_key_prefix (`str`, *optional*, defaults to `"test"`): + An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named + "test_bleu" if the prefix is "test" (default) + + + + If your predictions or labels have different sequence length (for instance because you're doing dynamic padding + in a token classification task) the predictions will be padded (on the right) to allow for concatenation into + one array. The padding index is -100. + + + + Returns: *NamedTuple* A namedtuple with the following keys: + + - predictions (`np.ndarray`): The predictions on `test_dataset`. + - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some). + - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained + labels). + """ + # memory metrics - must set up as early as possible + self._memory_tracker.start() + + test_dataloader = self.get_test_dataloader(test_dataset) + start_time = time.time() + + eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop + output = eval_loop( + test_dataloader, description="Prediction", ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix + ) + total_batch_size = self.args.eval_batch_size * self.args.world_size + if f"{metric_key_prefix}_jit_compilation_time" in output.metrics: + start_time += output.metrics[f"{metric_key_prefix}_jit_compilation_time"] + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.control = self.callback_handler.on_predict(self.args, self.state, self.control, output.metrics) + self._memory_tracker.stop_and_update_metrics(output.metrics) + + return PredictionOutput(predictions=output.predictions, label_ids=output.label_ids, metrics=output.metrics) + + def evaluation_loop( + self, + dataloader: DataLoader, + description: str, + prediction_loss_only: Optional[bool] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + ) -> EvalLoopOutput: + """ + Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`. + + Works both with or without labels. + """ + args = self.args + + prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else args.prediction_loss_only + + # if eval is called w/o train init deepspeed here + if args.deepspeed and not self.deepspeed: + # XXX: eval doesn't have `resume_from_checkpoint` arg but we should be able to do eval + # from the checkpoint eventually + deepspeed_engine, _, _ = deepspeed_init( + self, num_training_steps=0, resume_from_checkpoint=None, inference=True + ) + self.model = deepspeed_engine.module + self.model_wrapped = deepspeed_engine + self.deepspeed = deepspeed_engine + + model = self._wrap_model(self.model, training=False, dataloader=dataloader) + + # if full fp16 or bf16 eval is wanted and this ``evaluation`` or ``predict`` isn't called + # while ``train`` is running, cast it to the right dtype first and then put on device + if not self.is_in_train: + if args.fp16_full_eval: + model = model.to(dtype=torch.float16, device=args.device) + elif args.bf16_full_eval: + model = model.to(dtype=torch.bfloat16, device=args.device) + + batch_size = self.args.eval_batch_size + + logger.info(f"***** Running {description} *****") + if has_length(dataloader): + logger.info(f" Num examples = {self.num_examples(dataloader)}") + else: + logger.info(" Num examples: Unknown") + logger.info(f" Batch size = {batch_size}") + + model.eval() + + self.callback_handler.eval_dataloader = dataloader + # Do this before wrapping. + eval_dataset = getattr(dataloader, "dataset", None) + + if is_torch_tpu_available(): + dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device) + + if args.past_index >= 0: + self._past = None + + # Initialize containers + # losses/preds/labels on GPU/TPU (accumulated for eval_accumulation_steps) + losses_host = None + preds_host = None + labels_host = None + inputs_host = None + + # losses/preds/labels on CPU (final containers) + all_losses = None + all_preds = None + all_labels = None + all_inputs = None + # Will be useful when we have an iterable dataset so don't know its length. + + observed_num_examples = 0 + # Main evaluation loop + for step, inputs in enumerate(dataloader): + # Update the observed num examples + observed_batch_size = find_batch_size(inputs) + if observed_batch_size is not None: + observed_num_examples += observed_batch_size + # For batch samplers, batch_size is not known by the dataloader in advance. + if batch_size is None: + batch_size = observed_batch_size + + # Prediction step + loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) + inputs_decode = self._prepare_input(inputs["input_ids"]) if args.include_inputs_for_metrics else None + + if is_torch_tpu_available(): + xm.mark_step() + + # Update containers on host + if loss is not None: + losses = self._nested_gather(loss.repeat(batch_size)) + losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0) + if labels is not None: + labels = self._pad_across_processes(labels) + labels = self._nested_gather(labels) + labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100) + if inputs_decode is not None: + inputs_decode = self._pad_across_processes(inputs_decode) + inputs_decode = self._nested_gather(inputs_decode) + inputs_host = ( + inputs_decode + if inputs_host is None + else nested_concat(inputs_host, inputs_decode, padding_index=-100) + ) + if logits is not None: + logits = self._pad_across_processes(logits) + logits = self._nested_gather(logits) + if self.preprocess_logits_for_metrics is not None: + logits = self.preprocess_logits_for_metrics(logits, labels) + preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100) + self.control = self.callback_handler.on_prediction_step(args, self.state, self.control) + + # Gather all tensors and put them back on the CPU if we have done enough accumulation steps. + if args.eval_accumulation_steps is not None and (step + 1) % args.eval_accumulation_steps == 0: + if losses_host is not None: + losses = nested_numpify(losses_host) + all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0) + if preds_host is not None: + logits = nested_numpify(preds_host) + all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100) + if inputs_host is not None: + inputs_decode = nested_numpify(inputs_host) + all_inputs = ( + inputs_decode + if all_inputs is None + else nested_concat(all_inputs, inputs_decode, padding_index=-100) + ) + if labels_host is not None: + labels = nested_numpify(labels_host) + all_labels = ( + labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100) + ) + + # Set back to None to begin a new accumulation + losses_host, preds_host, inputs_host, labels_host = None, None, None, None + + if args.past_index and hasattr(self, "_past"): + # Clean the state at the end of the evaluation loop + delattr(self, "_past") + + # Gather all remaining tensors and put them back on the CPU + if losses_host is not None: + losses = nested_numpify(losses_host) + all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0) + if preds_host is not None: + logits = nested_numpify(preds_host) + all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100) + if inputs_host is not None: + inputs_decode = nested_numpify(inputs_host) + all_inputs = ( + inputs_decode if all_inputs is None else nested_concat(all_inputs, inputs_decode, padding_index=-100) + ) + if labels_host is not None: + labels = nested_numpify(labels_host) + all_labels = labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100) + + # Number of samples + if has_length(eval_dataset): + num_samples = len(eval_dataset) + # The instance check is weird and does not actually check for the type, but whether the dataset has the right + # methods. Therefore we need to make sure it also has the attribute. + elif isinstance(eval_dataset, IterableDatasetShard) and getattr(eval_dataset, "num_examples", 0) > 0: + num_samples = eval_dataset.num_examples + else: + if has_length(dataloader): + num_samples = self.num_examples(dataloader) + else: # both len(dataloader.dataset) and len(dataloader) fail + num_samples = observed_num_examples + if num_samples == 0 and observed_num_examples > 0: + num_samples = observed_num_examples + + # Number of losses has been rounded to a multiple of batch_size and in a distributed training, the number of + # samplers has been rounded to a multiple of batch_size, so we truncate. + if all_losses is not None: + all_losses = all_losses[:num_samples] + if all_preds is not None: + all_preds = nested_truncate(all_preds, num_samples) + if all_labels is not None: + all_labels = nested_truncate(all_labels, num_samples) + if all_inputs is not None: + all_inputs = nested_truncate(all_inputs, num_samples) + + # Metrics! + if self.compute_metrics is not None and all_preds is not None and all_labels is not None: + if args.include_inputs_for_metrics: + metrics = self.compute_metrics( + EvalPrediction(predictions=all_preds, label_ids=all_labels, inputs=all_inputs) + ) + else: + metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels)) + else: + metrics = {} + + # To be JSON-serializable, we need to remove numpy types or zero-d tensors + metrics = denumpify_detensorize(metrics) + + if all_losses is not None: + metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item() + if hasattr(self, "jit_compilation_time"): + metrics[f"{metric_key_prefix}_jit_compilation_time"] = self.jit_compilation_time + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples) + + def _nested_gather(self, tensors, name=None): + """ + Gather value of `tensors` (tensor or list/tuple of nested tensors) and convert them to numpy before + concatenating them to `gathered` + """ + if tensors is None: + return + if is_torch_tpu_available(): + if name is None: + name = "nested_gather" + tensors = nested_xla_mesh_reduce(tensors, name) + elif is_sagemaker_mp_enabled(): + tensors = smp_gather(tensors) + elif self.args.local_rank != -1: + tensors = distributed_concat(tensors) + return tensors + + # Copied from Accelerate. + def _pad_across_processes(self, tensor, pad_index=-100): + """ + Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so + they can safely be gathered. + """ + if isinstance(tensor, (list, tuple)): + return type(tensor)(self._pad_across_processes(t, pad_index=pad_index) for t in tensor) + elif isinstance(tensor, dict): + return type(tensor)({k: self._pad_across_processes(v, pad_index=pad_index) for k, v in tensor.items()}) + elif not isinstance(tensor, torch.Tensor): + raise TypeError( + f"Can't pad the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors." + ) + + if len(tensor.shape) < 2: + return tensor + # Gather all sizes + size = torch.tensor(tensor.shape, device=tensor.device)[None] + sizes = self._nested_gather(size).cpu() + + max_size = max(s[1] for s in sizes) + # When extracting XLA graphs for compilation, max_size is 0, + # so use inequality to avoid errors. + if tensor.shape[1] >= max_size: + return tensor + + # Then pad to the maximum size + old_size = tensor.shape + new_size = list(old_size) + new_size[1] = max_size + new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index + new_tensor[:, : old_size[1]] = tensor + return new_tensor + + def prediction_step( + self, + model: nn.Module, + inputs: Dict[str, Union[torch.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]: + """ + Perform an evaluation step on `model` using `inputs`. + + Subclass and override to inject custom behavior. + + Args: + model (`nn.Module`): + The model to evaluate. + inputs (`Dict[str, Union[torch.Tensor, Any]]`): + The inputs and targets of the model. + + The dictionary will be unpacked before being fed to the model. Most models expect the targets under the + argument `labels`. Check your model's documentation for all accepted arguments. + prediction_loss_only (`bool`): + Whether or not to return the loss only. + ignore_keys (`List[str]`, *optional*): + A list of keys in the output of your model (if it is a dictionary) that should be ignored when + gathering predictions. + + Return: + Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss, + logits and labels (each being optional). + """ + has_labels = False if len(self.label_names) == 0 else all(inputs.get(k) is not None for k in self.label_names) + # For CLIP-like models capable of returning loss values. + # If `return_loss` is not specified or being `None` in `inputs`, we check if the default value of `return_loss` + # is `True` in `model.forward`. + return_loss = inputs.get("return_loss", None) + if return_loss is None: + return_loss = self.can_return_loss + loss_without_labels = True if len(self.label_names) == 0 and return_loss else False + + inputs = self._prepare_inputs(inputs) + if ignore_keys is None: + if hasattr(self.model, "config"): + ignore_keys = getattr(self.model.config, "keys_to_ignore_at_inference", []) + else: + ignore_keys = [] + + # labels may be popped when computing the loss (label smoothing for instance) so we grab them first. + if has_labels or loss_without_labels: + labels = nested_detach(tuple(inputs.get(name) for name in self.label_names)) + if len(labels) == 1: + labels = labels[0] + else: + labels = None + + with torch.no_grad(): + if is_sagemaker_mp_enabled(): + raw_outputs = smp_forward_only(model, inputs) + if has_labels or loss_without_labels: + if isinstance(raw_outputs, dict): + loss_mb = raw_outputs["loss"] + logits_mb = tuple(v for k, v in raw_outputs.items() if k not in ignore_keys + ["loss"]) + else: + loss_mb = raw_outputs[0] + logits_mb = raw_outputs[1:] + + loss = loss_mb.reduce_mean().detach().cpu() + logits = smp_nested_concat(logits_mb) + else: + loss = None + if isinstance(raw_outputs, dict): + logits_mb = tuple(v for k, v in raw_outputs.items() if k not in ignore_keys) + else: + logits_mb = raw_outputs + logits = smp_nested_concat(logits_mb) + else: + if has_labels or loss_without_labels: + with self.compute_loss_context_manager(): + loss, outputs = self.compute_loss(model, inputs, return_outputs=True) + loss = loss.mean().detach() + + if isinstance(outputs, dict): + logits = tuple(v for k, v in outputs.items() if k not in ignore_keys + ["loss"]) + else: + logits = outputs[1:] + else: + loss = None + with self.compute_loss_context_manager(): + outputs = model(**inputs) + if isinstance(outputs, dict): + logits = tuple(v for k, v in outputs.items() if k not in ignore_keys) + else: + logits = outputs + # TODO: this needs to be fixed and made cleaner later. + if self.args.past_index >= 0: + self._past = outputs[self.args.past_index - 1] + + if prediction_loss_only: + return (loss, None, None) + + logits = nested_detach(logits) + if len(logits) == 1: + logits = logits[0] + + return (loss, logits, labels) + + def floating_point_ops(self, inputs: Dict[str, Union[torch.Tensor, Any]]): + """ + For models that inherit from [`PreTrainedModel`], uses that method to compute the number of floating point + operations for every backward + forward pass. If using another model, either implement such a method in the + model or subclass and override this method. + + Args: + inputs (`Dict[str, Union[torch.Tensor, Any]]`): + The inputs and targets of the model. + + Returns: + `int`: The number of floating-point operations. + """ + if hasattr(self.model, "floating_point_ops"): + return self.model.floating_point_ops(inputs) + else: + return 0 + + def init_git_repo(self, at_init: bool = False): + """ + Initializes a git repo in `self.args.hub_model_id`. + + Args: + at_init (`bool`, *optional*, defaults to `False`): + Whether this function is called before any training or not. If `self.args.overwrite_output_dir` is + `True` and `at_init` is `True`, the path to the repo (which is `self.args.output_dir`) might be wiped + out. + """ + if not self.is_world_process_zero(): + return + if self.args.hub_model_id is None: + repo_name = Path(self.args.output_dir).absolute().name + else: + repo_name = self.args.hub_model_id + if "/" not in repo_name: + repo_name = get_full_repo_name(repo_name, token=self.args.hub_token) + + # Make sure the repo exists. + create_repo(repo_name, token=self.args.hub_token, private=self.args.hub_private_repo, exist_ok=True) + try: + self.repo = Repository(self.args.output_dir, clone_from=repo_name, token=self.args.hub_token) + except EnvironmentError: + if self.args.overwrite_output_dir and at_init: + # Try again after wiping output_dir + shutil.rmtree(self.args.output_dir) + self.repo = Repository(self.args.output_dir, clone_from=repo_name, token=self.args.hub_token) + else: + raise + + self.repo.git_pull() + + # By default, ignore the checkpoint folders + if ( + not os.path.exists(os.path.join(self.args.output_dir, ".gitignore")) + and self.args.hub_strategy != HubStrategy.ALL_CHECKPOINTS + ): + with open(os.path.join(self.args.output_dir, ".gitignore"), "w", encoding="utf-8") as writer: + writer.writelines(["checkpoint-*/"]) + + # Add "*.sagemaker" to .gitignore if using SageMaker + if os.environ.get("SM_TRAINING_ENV"): + self._add_sm_patterns_to_gitignore() + + self.push_in_progress = None + + def create_model_card( + self, + language: Optional[str] = None, + license: Optional[str] = None, + tags: Union[str, List[str], None] = None, + model_name: Optional[str] = None, + finetuned_from: Optional[str] = None, + tasks: Union[str, List[str], None] = None, + dataset_tags: Union[str, List[str], None] = None, + dataset: Union[str, List[str], None] = None, + dataset_args: Union[str, List[str], None] = None, + ): + """ + Creates a draft of a model card using the information available to the `Trainer`. + + Args: + language (`str`, *optional*): + The language of the model (if applicable) + license (`str`, *optional*): + The license of the model. Will default to the license of the pretrained model used, if the original + model given to the `Trainer` comes from a repo on the Hub. + tags (`str` or `List[str]`, *optional*): + Some tags to be included in the metadata of the model card. + model_name (`str`, *optional*): + The name of the model. + finetuned_from (`str`, *optional*): + The name of the model used to fine-tune this one (if applicable). Will default to the name of the repo + of the original model given to the `Trainer` (if it comes from the Hub). + tasks (`str` or `List[str]`, *optional*): + One or several task identifiers, to be included in the metadata of the model card. + dataset_tags (`str` or `List[str]`, *optional*): + One or several dataset tags, to be included in the metadata of the model card. + dataset (`str` or `List[str]`, *optional*): + One or several dataset identifiers, to be included in the metadata of the model card. + dataset_args (`str` or `List[str]`, *optional*): + One or several dataset arguments, to be included in the metadata of the model card. + """ + if not self.is_world_process_zero(): + return + + training_summary = TrainingSummary.from_trainer( + self, + language=language, + license=license, + tags=tags, + model_name=model_name, + finetuned_from=finetuned_from, + tasks=tasks, + dataset_tags=dataset_tags, + dataset=dataset, + dataset_args=dataset_args, + ) + model_card = training_summary.to_model_card() + with open(os.path.join(self.args.output_dir, "README.md"), "w") as f: + f.write(model_card) + + def _push_from_checkpoint(self, checkpoint_folder): + # Only push from one node. + if not self.is_world_process_zero() or self.args.hub_strategy == HubStrategy.END: + return + # If we haven't finished the last push, we don't do this one. + if self.push_in_progress is not None and not self.push_in_progress.is_done: + return + + output_dir = self.args.output_dir + # To avoid a new synchronization of all model weights, we just copy the file from the checkpoint folder + modeling_files = [CONFIG_NAME, WEIGHTS_NAME, SAFE_WEIGHTS_NAME] + for modeling_file in modeling_files: + if os.path.isfile(os.path.join(checkpoint_folder, modeling_file)): + shutil.copy(os.path.join(checkpoint_folder, modeling_file), os.path.join(output_dir, modeling_file)) + # Saving the tokenizer is fast and we don't know how many files it may have spawned, so we resave it to be sure. + if self.tokenizer is not None: + self.tokenizer.save_pretrained(output_dir) + # Same for the training arguments + torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME)) + + try: + if self.args.hub_strategy == HubStrategy.CHECKPOINT: + # Temporarily move the checkpoint just saved for the push + tmp_checkpoint = os.path.join(output_dir, "last-checkpoint") + # We have to remove the "last-checkpoint" dir if it exists, otherwise the checkpoint is moved as a + # subfolder. + if os.path.isdir(tmp_checkpoint): + shutil.rmtree(tmp_checkpoint) + shutil.move(checkpoint_folder, tmp_checkpoint) + + if self.args.save_strategy == IntervalStrategy.STEPS: + commit_message = f"Training in progress, step {self.state.global_step}" + else: + commit_message = f"Training in progress, epoch {int(self.state.epoch)}" + _, self.push_in_progress = self.repo.push_to_hub( + commit_message=commit_message, blocking=False, auto_lfs_prune=True + ) + finally: + if self.args.hub_strategy == HubStrategy.CHECKPOINT: + # Move back the checkpoint to its place + shutil.move(tmp_checkpoint, checkpoint_folder) + + def push_to_hub(self, commit_message: Optional[str] = "End of training", blocking: bool = True, **kwargs) -> str: + """ + Upload *self.model* and *self.tokenizer* to the 🤗 model hub on the repo *self.args.hub_model_id*. + + Parameters: + commit_message (`str`, *optional*, defaults to `"End of training"`): + Message to commit while pushing. + blocking (`bool`, *optional*, defaults to `True`): + Whether the function should return only when the `git push` has finished. + kwargs: + Additional keyword arguments passed along to [`~Trainer.create_model_card`]. + + Returns: + The url of the commit of your model in the given repository if `blocking=False`, a tuple with the url of + the commit and an object to track the progress of the commit if `blocking=True` + """ + # If a user calls manually `push_to_hub` with `self.args.push_to_hub = False`, we try to create the repo but + # it might fail. + if not hasattr(self, "repo"): + self.init_git_repo() + + model_name = kwargs.pop("model_name", None) + if model_name is None and self.args.should_save: + if self.args.hub_model_id is None: + model_name = Path(self.args.output_dir).name + else: + model_name = self.args.hub_model_id.split("/")[-1] + + # Needs to be executed on all processes for TPU training, but will only save on the processed determined by + # self.args.should_save. + self.save_model(_internal_call=True) + + # Only push from one node. + if not self.is_world_process_zero(): + return + + # Cancel any async push in progress if blocking=True. The commits will all be pushed together. + if blocking and self.push_in_progress is not None and not self.push_in_progress.is_done: + self.push_in_progress._process.kill() + self.push_in_progress = None + + git_head_commit_url = self.repo.push_to_hub( + commit_message=commit_message, blocking=blocking, auto_lfs_prune=True + ) + # push separately the model card to be independant from the rest of the model + if self.args.should_save: + self.create_model_card(model_name=model_name, **kwargs) + try: + self.repo.push_to_hub( + commit_message="update model card README.md", blocking=blocking, auto_lfs_prune=True + ) + except EnvironmentError as exc: + logger.error(f"Error pushing update to the model card. Please read logs and retry.\n${exc}") + + return git_head_commit_url + + # + # Deprecated code + # + + def prediction_loop( + self, + dataloader: DataLoader, + description: str, + prediction_loss_only: Optional[bool] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + ) -> EvalLoopOutput: + """ + Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`. + + Works both with or without labels. + """ + args = self.args + + if not has_length(dataloader): + raise ValueError("dataloader must implement a working __len__") + + prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else args.prediction_loss_only + + # if eval is called w/o train init deepspeed here + if args.deepspeed and not self.deepspeed: + # XXX: eval doesn't have `resume_from_checkpoint` arg but we should be able to do eval + # from the checkpoint eventually + deepspeed_engine, _, _ = deepspeed_init(self, num_training_steps=0, resume_from_checkpoint=None) + self.model = deepspeed_engine.module + self.model_wrapped = deepspeed_engine + self.deepspeed = deepspeed_engine + # XXX: we don't need optim/sched for inference, but this needs to be sorted out, since + # for example the Z3-optimizer is a must for zero3 to work even for inference - what we + # don't need is the deepspeed basic optimizer which is self.optimizer.optimizer + deepspeed_engine.optimizer.optimizer = None + deepspeed_engine.lr_scheduler = None + + model = self._wrap_model(self.model, training=False, dataloader=dataloader) + + # if full fp16 or bf16 eval is wanted and this ``evaluation`` or ``predict`` isn't called + # while ``train`` is running, cast it to the right dtype first and then put on device + if not self.is_in_train: + if args.fp16_full_eval: + model = model.to(dtype=torch.float16, device=args.device) + elif args.bf16_full_eval: + model = model.to(dtype=torch.bfloat16, device=args.device) + + batch_size = dataloader.batch_size + num_examples = self.num_examples(dataloader) + logger.info(f"***** Running {description} *****") + logger.info(f" Num examples = {num_examples}") + logger.info(f" Batch size = {batch_size}") + losses_host: torch.Tensor = None + preds_host: Union[torch.Tensor, List[torch.Tensor]] = None + labels_host: Union[torch.Tensor, List[torch.Tensor]] = None + inputs_host: Union[torch.Tensor, List[torch.Tensor]] = None + + world_size = max(1, args.world_size) + + eval_losses_gatherer = DistributedTensorGatherer(world_size, num_examples, make_multiple_of=batch_size) + if not prediction_loss_only: + # The actual number of eval_sample can be greater than num_examples in distributed settings (when we pass + # a batch size to the sampler) + make_multiple_of = None + if hasattr(dataloader, "sampler") and isinstance(dataloader.sampler, SequentialDistributedSampler): + make_multiple_of = dataloader.sampler.batch_size + preds_gatherer = DistributedTensorGatherer(world_size, num_examples, make_multiple_of=make_multiple_of) + labels_gatherer = DistributedTensorGatherer(world_size, num_examples, make_multiple_of=make_multiple_of) + inputs_gatherer = DistributedTensorGatherer(world_size, num_examples, make_multiple_of=make_multiple_of) + + model.eval() + + if is_torch_tpu_available(): + dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device) + + if args.past_index >= 0: + self._past = None + + self.callback_handler.eval_dataloader = dataloader + + for step, inputs in enumerate(dataloader): + loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) + inputs_decode = self._prepare_input(inputs["input_ids"]) if args.include_inputs_for_metrics else None + + if loss is not None: + losses = loss.repeat(batch_size) + losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0) + if logits is not None: + preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100) + if labels is not None: + labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100) + if inputs_decode is not None: + inputs_host = ( + inputs_decode + if inputs_host is None + else nested_concat(inputs_host, inputs_decode, padding_index=-100) + ) + self.control = self.callback_handler.on_prediction_step(args, self.state, self.control) + + # Gather all tensors and put them back on the CPU if we have done enough accumulation steps. + if args.eval_accumulation_steps is not None and (step + 1) % args.eval_accumulation_steps == 0: + eval_losses_gatherer.add_arrays(self._gather_and_numpify(losses_host, "eval_losses")) + if not prediction_loss_only: + preds_gatherer.add_arrays(self._gather_and_numpify(preds_host, "eval_preds")) + labels_gatherer.add_arrays(self._gather_and_numpify(labels_host, "eval_label_ids")) + inputs_gatherer.add_arrays(self._gather_and_numpify(inputs_host, "eval_inputs_ids")) + + # Set back to None to begin a new accumulation + losses_host, preds_host, labels_host, inputs_host = None, None, None, None + + if args.past_index and hasattr(self, "_past"): + # Clean the state at the end of the evaluation loop + delattr(self, "_past") + + # Gather all remaining tensors and put them back on the CPU + eval_losses_gatherer.add_arrays(self._gather_and_numpify(losses_host, "eval_losses")) + if not prediction_loss_only: + preds_gatherer.add_arrays(self._gather_and_numpify(preds_host, "eval_preds")) + labels_gatherer.add_arrays(self._gather_and_numpify(labels_host, "eval_label_ids")) + inputs_gatherer.add_arrays(self._gather_and_numpify(inputs_host, "eval_inputs_ids")) + + eval_loss = eval_losses_gatherer.finalize() + preds = preds_gatherer.finalize() if not prediction_loss_only else None + label_ids = labels_gatherer.finalize() if not prediction_loss_only else None + inputs_ids = inputs_gatherer.finalize() if not prediction_loss_only else None + + if self.compute_metrics is not None and preds is not None and label_ids is not None: + if args.include_inputs_for_metrics: + metrics = self.compute_metrics( + EvalPrediction(predictions=preds, label_ids=label_ids, inputs=inputs_ids) + ) + else: + metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids)) + else: + metrics = {} + + # To be JSON-serializable, we need to remove numpy types or zero-d tensors + metrics = denumpify_detensorize(metrics) + + if eval_loss is not None: + metrics[f"{metric_key_prefix}_loss"] = eval_loss.mean().item() + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + return EvalLoopOutput(predictions=preds, label_ids=label_ids, metrics=metrics, num_samples=num_examples) + + def _gather_and_numpify(self, tensors, name): + """ + Gather value of `tensors` (tensor or list/tuple of nested tensors) and convert them to numpy before + concatenating them to `gathered` + """ + if tensors is None: + return + if is_torch_tpu_available(): + tensors = nested_xla_mesh_reduce(tensors, name) + elif is_sagemaker_mp_enabled(): + tensors = smp_gather(tensors) + elif self.args.local_rank != -1: + tensors = distributed_concat(tensors) + + return nested_numpify(tensors) + + def _add_sm_patterns_to_gitignore(self) -> None: + """Add SageMaker Checkpointing patterns to .gitignore file.""" + # Make sure we only do this on the main process + if not self.is_world_process_zero(): + return + + patterns = ["*.sagemaker-uploading", "*.sagemaker-uploaded"] + + # Get current .gitignore content + if os.path.exists(os.path.join(self.repo.local_dir, ".gitignore")): + with open(os.path.join(self.repo.local_dir, ".gitignore"), "r") as f: + current_content = f.read() + else: + current_content = "" + + # Add the patterns to .gitignore + content = current_content + for pattern in patterns: + if pattern not in content: + if content.endswith("\n"): + content += pattern + else: + content += f"\n{pattern}" + + # Write the .gitignore file if it has changed + if content != current_content: + with open(os.path.join(self.repo.local_dir, ".gitignore"), "w") as f: + logger.debug(f"Writing .gitignore file. Content: {content}") + f.write(content) + + self.repo.git_add(".gitignore") + + # avoid race condition with git status + time.sleep(0.5) + + if not self.repo.is_repo_clean(): + self.repo.git_commit("Add *.sagemaker patterns to .gitignore.") + self.repo.git_push() diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer_pt_utils.py b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer_pt_utils.py new file mode 100644 index 0000000000..57b5c68f37 --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/trainer_pt_utils.py @@ -0,0 +1,1126 @@ +# coding=utf-8 +# Copyright 2020-present the HuggingFace Inc. team. +# Copyright 2023 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Torch utilities for the Trainer class. +""" + +import datetime +import json +import math +import os +import sys +import warnings +from collections.abc import Mapping +from contextlib import contextmanager +from dataclasses import dataclass +from logging import StreamHandler +from typing import Any, Dict, Iterator, List, Optional, Union + +import numpy as np +import torch +import torch.distributed as dist +from torch import nn +from torch.utils.data import Dataset, IterableDataset, RandomSampler, Sampler +from torch.utils.data.distributed import DistributedSampler + +from .deepspeed import is_deepspeed_zero3_enabled +from .tokenization_utils_base import BatchEncoding +from .utils import is_sagemaker_mp_enabled, is_torch_tpu_available, is_training_run_on_sagemaker, logging + + +if is_training_run_on_sagemaker(): + logging.add_handler(StreamHandler(sys.stdout)) + +if is_torch_tpu_available(check_device=False): + import torch_xla.core.xla_model as xm + +# this is used to suppress an undesired warning emitted by pytorch versions 1.4.2-1.7.0 +try: + from torch.optim.lr_scheduler import SAVE_STATE_WARNING +except ImportError: + SAVE_STATE_WARNING = "" + +logger = logging.get_logger(__name__) + + +def atleast_1d(tensor_or_array: Union[torch.Tensor, np.ndarray]): + if isinstance(tensor_or_array, torch.Tensor): + if hasattr(torch, "atleast_1d"): + tensor_or_array = torch.atleast_1d(tensor_or_array) + elif tensor_or_array.ndim < 1: + tensor_or_array = tensor_or_array[None] + else: + tensor_or_array = np.atleast_1d(tensor_or_array) + return tensor_or_array + + +def torch_pad_and_concatenate(tensor1, tensor2, padding_index=-100): + """Concatenates `tensor1` and `tensor2` on first axis, applying padding on the second if necessary.""" + tensor1 = atleast_1d(tensor1) + tensor2 = atleast_1d(tensor2) + + if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]: + return torch.cat((tensor1, tensor2), dim=0) + + # Let's figure out the new shape + new_shape = (tensor1.shape[0] + tensor2.shape[0], max(tensor1.shape[1], tensor2.shape[1])) + tensor1.shape[2:] + + # Now let's fill the result tensor + result = tensor1.new_full(new_shape, padding_index) + result[: tensor1.shape[0], : tensor1.shape[1]] = tensor1 + result[tensor1.shape[0] :, : tensor2.shape[1]] = tensor2 + return result + + +def numpy_pad_and_concatenate(array1, array2, padding_index=-100): + """Concatenates `array1` and `array2` on first axis, applying padding on the second if necessary.""" + array1 = atleast_1d(array1) + array2 = atleast_1d(array2) + + if len(array1.shape) == 1 or array1.shape[1] == array2.shape[1]: + return np.concatenate((array1, array2), axis=0) + + # Let's figure out the new shape + new_shape = (array1.shape[0] + array2.shape[0], max(array1.shape[1], array2.shape[1])) + array1.shape[2:] + + # Now let's fill the result tensor + result = np.full_like(array1, padding_index, shape=new_shape) + result[: array1.shape[0], : array1.shape[1]] = array1 + result[array1.shape[0] :, : array2.shape[1]] = array2 + return result + + +def nested_concat(tensors, new_tensors, padding_index=-100): + """ + Concat the `new_tensors` to `tensors` on the first dim and pad them on the second if needed. Works for tensors or + nested list/tuples/dict of tensors. + """ + assert type(tensors) == type( + new_tensors + ), f"Expected `tensors` and `new_tensors` to have the same type but found {type(tensors)} and {type(new_tensors)}." + if isinstance(tensors, (list, tuple)): + return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors)) + elif isinstance(tensors, torch.Tensor): + return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index) + elif isinstance(tensors, Mapping): + return type(tensors)( + {k: nested_concat(t, new_tensors[k], padding_index=padding_index) for k, t in tensors.items()} + ) + elif isinstance(tensors, np.ndarray): + return numpy_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index) + else: + raise TypeError(f"Unsupported type for concatenation: got {type(tensors)}") + + +def find_batch_size(tensors): + """ + Find the first dimension of a tensor in a nested list/tuple/dict of tensors. + """ + if isinstance(tensors, (list, tuple)): + for t in tensors: + result = find_batch_size(t) + if result is not None: + return result + elif isinstance(tensors, Mapping): + for key, value in tensors.items(): + result = find_batch_size(value) + if result is not None: + return result + elif isinstance(tensors, torch.Tensor): + return tensors.shape[0] if len(tensors.shape) >= 1 else None + elif isinstance(tensors, np.ndarray): + return tensors.shape[0] if len(tensors.shape) >= 1 else None + + +def nested_numpify(tensors): + "Numpify `tensors` (even if it's a nested list/tuple/dict of tensors)." + if isinstance(tensors, (list, tuple)): + return type(tensors)(nested_numpify(t) for t in tensors) + if isinstance(tensors, Mapping): + return type(tensors)({k: nested_numpify(t) for k, t in tensors.items()}) + + t = tensors.cpu() + if t.dtype == torch.bfloat16: + # As of Numpy 1.21.4, NumPy does not support bfloat16 (see + # https://github.com/numpy/numpy/blob/a47ecdea856986cd60eabbd53265c2ca5916ad5d/doc/source/user/basics.types.rst ). + # Until Numpy adds bfloat16, we must convert float32. + t = t.to(torch.float32) + return t.numpy() + + +def nested_detach(tensors): + "Detach `tensors` (even if it's a nested list/tuple/dict of tensors)." + if isinstance(tensors, (list, tuple)): + return type(tensors)(nested_detach(t) for t in tensors) + elif isinstance(tensors, Mapping): + return type(tensors)({k: nested_detach(t) for k, t in tensors.items()}) + return tensors.detach() + + +def nested_xla_mesh_reduce(tensors, name): + if is_torch_tpu_available(): + import torch_xla.core.xla_model as xm + + if isinstance(tensors, (list, tuple)): + return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors)) + if isinstance(tensors, Mapping): + return type(tensors)( + {k: nested_xla_mesh_reduce(t, f"{name}_{i}") for i, (k, t) in enumerate(tensors.items())} + ) + + tensors = atleast_1d(tensors) + return xm.mesh_reduce(name, tensors, torch.cat) + else: + raise ImportError("Torch xla must be installed to use `nested_xla_mesh_reduce`") + + +def distributed_concat(tensor: Any, num_total_examples: Optional[int] = None) -> Any: + try: + if isinstance(tensor, (tuple, list)): + return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor) + if isinstance(tensor, Mapping): + return type(tensor)({k: distributed_concat(t, num_total_examples) for k, t in tensor.items()}) + tensor = atleast_1d(tensor).contiguous() + output_tensors = [tensor.clone() for _ in range(dist.get_world_size())] + dist.all_gather(output_tensors, tensor) + torch.cuda.synchronize() + concat = torch.cat(output_tensors, dim=0) + + # truncate the dummy elements added by SequentialDistributedSampler + if num_total_examples is not None: + concat = concat[:num_total_examples] + return concat + except AssertionError: + raise AssertionError("Not currently using distributed training") + + +def distributed_broadcast_scalars( + scalars: List[Union[int, float]], + num_total_examples: Optional[int] = None, + device: Optional[torch.device] = torch.device("cuda"), +) -> torch.Tensor: + try: + tensorized_scalar = torch.tensor(scalars).to(device) + output_tensors = [tensorized_scalar.clone() for _ in range(dist.get_world_size())] + dist.all_gather(output_tensors, tensorized_scalar) + concat = torch.cat(output_tensors, dim=0) + + # truncate the dummy elements added by SequentialDistributedSampler + if num_total_examples is not None: + concat = concat[:num_total_examples] + return concat + except AssertionError: + raise AssertionError("Not currently using distributed training") + + +def reissue_pt_warnings(caught_warnings): + # Reissue warnings that are not the SAVE_STATE_WARNING + if len(caught_warnings) > 1: + for w in caught_warnings: + if w.category != UserWarning or w.message != SAVE_STATE_WARNING: + warnings.warn(w.message, w.category) + + +@contextmanager +def torch_distributed_zero_first(local_rank: int): + """ + Decorator to make all processes in distributed training wait for each local_master to do something. + + Args: + local_rank (`int`): The rank of the local process. + """ + if local_rank not in [-1, 0]: + dist.barrier() + yield + if local_rank == 0: + dist.barrier() + + +class DistributedSamplerWithLoop(DistributedSampler): + """ + Like a torch.utils.data.distributed.DistributedSampler` but loops at the end back to the beginning of the shuffled + samples to make each process have a round multiple of batch_size samples. + + Args: + dataset (`torch.utils.data.Dataset`): + Dataset used for sampling. + batch_size (`int`): + The batch size used with this sampler + kwargs: + All other keyword arguments passed to `DistributedSampler`. + """ + + def __init__(self, dataset, batch_size, **kwargs): + super().__init__(dataset, **kwargs) + self.batch_size = batch_size + + def __iter__(self): + indices = list(super().__iter__()) + remainder = 0 if len(indices) % self.batch_size == 0 else self.batch_size - len(indices) % self.batch_size + # DistributedSampler already added samples from the beginning to make the number of samples a round multiple + # of the world size, so we skip those. + start_remainder = 1 if self.rank < len(self.dataset) % self.num_replicas else 0 + indices += indices[start_remainder : start_remainder + remainder] + return iter(indices) + + +class SequentialDistributedSampler(Sampler): + """ + Distributed Sampler that subsamples indices sequentially, making it easier to collate all results at the end. + + Even though we only use this sampler for eval and predict (no training), which means that the model params won't + have to be synced (i.e. will not hang for synchronization even if varied number of forward passes), we still add + extra samples to the sampler to make it evenly divisible (like in `DistributedSampler`) to make it easy to `gather` + or `reduce` resulting tensors at the end of the loop. + """ + + def __init__(self, dataset, num_replicas=None, rank=None, batch_size=None): + warnings.warn( + "SequentialDistributedSampler is deprecated and will be removed in v5 of Transformers.", + FutureWarning, + ) + if num_replicas is None: + if not dist.is_available(): + raise RuntimeError("Requires distributed package to be available") + num_replicas = dist.get_world_size() + if rank is None: + if not dist.is_available(): + raise RuntimeError("Requires distributed package to be available") + rank = dist.get_rank() + self.dataset = dataset + self.num_replicas = num_replicas + self.rank = rank + num_samples = len(self.dataset) + # Add extra samples to make num_samples a multiple of batch_size if passed + if batch_size is not None: + self.num_samples = int(math.ceil(num_samples / (batch_size * num_replicas))) * batch_size + else: + self.num_samples = int(math.ceil(num_samples / num_replicas)) + self.total_size = self.num_samples * self.num_replicas + self.batch_size = batch_size + + def __iter__(self): + indices = list(range(len(self.dataset))) + + # add extra samples to make it evenly divisible + indices += indices[: (self.total_size - len(indices))] + assert ( + len(indices) == self.total_size + ), f"Indices length {len(indices)} and total size {self.total_size} mismatched" + + # subsample + indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples] + assert ( + len(indices) == self.num_samples + ), f"Indices length {len(indices)} and sample number {self.num_samples} mismatched" + + return iter(indices) + + def __len__(self): + return self.num_samples + + +def get_tpu_sampler(dataset: torch.utils.data.Dataset, batch_size: int): + if xm.xrt_world_size() <= 1: + return RandomSampler(dataset) + return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()) + + +def nested_new_like(arrays, num_samples, padding_index=-100): + """Create the same nested structure as `arrays` with a first dimension always at `num_samples`.""" + if isinstance(arrays, (list, tuple)): + return type(arrays)(nested_new_like(x, num_samples) for x in arrays) + return np.full_like(arrays, padding_index, shape=(num_samples, *arrays.shape[1:])) + + +def expand_like(arrays, new_seq_length, padding_index=-100): + """Expand the `arrays` so that the second dimension grows to `new_seq_length`. Uses `padding_index` for padding.""" + result = np.full_like(arrays, padding_index, shape=(arrays.shape[0], new_seq_length) + arrays.shape[2:]) + result[:, : arrays.shape[1]] = arrays + return result + + +def nested_truncate(tensors, limit): + "Truncate `tensors` at `limit` (even if it's a nested list/tuple/dict of tensors)." + if isinstance(tensors, (list, tuple)): + return type(tensors)(nested_truncate(t, limit) for t in tensors) + if isinstance(tensors, Mapping): + return type(tensors)({k: nested_truncate(t, limit) for k, t in tensors.items()}) + + return tensors[:limit] + + +class DistributedTensorGatherer: + """ + A class responsible for properly gathering tensors (or nested list/tuple of tensors) on the CPU by chunks. + + If our dataset has 16 samples with a batch size of 2 on 3 processes and we gather then transfer on CPU at every + step, our sampler will generate the following indices: + + `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1]` + + to get something of size a multiple of 3 (so that each process gets the same dataset length). Then process 0, 1 and + 2 will be responsible of making predictions for the following samples: + + - P0: `[0, 1, 2, 3, 4, 5]` + - P1: `[6, 7, 8, 9, 10, 11]` + - P2: `[12, 13, 14, 15, 0, 1]` + + The first batch treated on each process will be + + - P0: `[0, 1]` + - P1: `[6, 7]` + - P2: `[12, 13]` + + So if we gather at the end of the first batch, we will get a tensor (nested list/tuple of tensor) corresponding to + the following indices: + + `[0, 1, 6, 7, 12, 13]` + + If we directly concatenate our results without taking any precautions, the user will then get the predictions for + the indices in this order at the end of the prediction loop: + + `[0, 1, 6, 7, 12, 13, 2, 3, 8, 9, 14, 15, 4, 5, 10, 11, 0, 1]` + + For some reason, that's not going to roll their boat. This class is there to solve that problem. + + Args: + world_size (`int`): + The number of processes used in the distributed training. + num_samples (`int`): + The number of samples in our dataset. + make_multiple_of (`int`, *optional*): + If passed, the class assumes the datasets passed to each process are made to be a multiple of this argument + (by adding samples). + padding_index (`int`, *optional*, defaults to -100): + The padding index to use if the arrays don't all have the same sequence length. + """ + + def __init__(self, world_size, num_samples, make_multiple_of=None, padding_index=-100): + warnings.warn( + "DistributedTensorGatherer is deprecated and will be removed in v5 of Transformers.", + FutureWarning, + ) + self.world_size = world_size + self.num_samples = num_samples + total_size = world_size if make_multiple_of is None else world_size * make_multiple_of + self.total_samples = int(np.ceil(num_samples / total_size)) * total_size + self.process_length = self.total_samples // world_size + self._storage = None + self._offsets = None + self.padding_index = padding_index + + def add_arrays(self, arrays): + """ + Add `arrays` to the internal storage, Will initialize the storage to the full size at the first arrays passed + so that if we're bound to get an OOM, it happens at the beginning. + """ + if arrays is None: + return + if self._storage is None: + self._storage = nested_new_like(arrays, self.total_samples, padding_index=self.padding_index) + self._offsets = list(range(0, self.total_samples, self.process_length)) + + slice_len, self._storage = self._nested_set_tensors(self._storage, arrays) + for i in range(self.world_size): + self._offsets[i] += slice_len + + def _nested_set_tensors(self, storage, arrays): + if isinstance(arrays, (list, tuple)): + result = [self._nested_set_tensors(x, y) for x, y in zip(storage, arrays)] + return result[0][0], type(arrays)(r[1] for r in result) + assert ( + arrays.shape[0] % self.world_size == 0 + ), f"Arrays passed should all have a first dimension multiple of {self.world_size}, found {arrays.shape[0]}." + + slice_len = arrays.shape[0] // self.world_size + for i in range(self.world_size): + if len(arrays.shape) == 1: + storage[self._offsets[i] : self._offsets[i] + slice_len] = arrays[i * slice_len : (i + 1) * slice_len] + else: + # Expand the array on the fly if needed. + if len(storage.shape) > 1 and storage.shape[1] < arrays.shape[1]: + storage = expand_like(storage, arrays.shape[1], padding_index=self.padding_index) + storage[self._offsets[i] : self._offsets[i] + slice_len, : arrays.shape[1]] = arrays[ + i * slice_len : (i + 1) * slice_len + ] + return slice_len, storage + + def finalize(self): + """ + Return the properly gathered arrays and truncate to the number of samples (since the sampler added some extras + to get each process a dataset of the same length). + """ + if self._storage is None: + return + if self._offsets[0] != self.process_length: + logger.warning("Not all data has been set. Are you sure you passed all values?") + return nested_truncate(self._storage, self.num_samples) + + +@dataclass +class LabelSmoother: + """ + Adds label-smoothing on a pre-computed output from a Transformers model. + + Args: + epsilon (`float`, *optional*, defaults to 0.1): + The label smoothing factor. + ignore_index (`int`, *optional*, defaults to -100): + The index in the labels to ignore when computing the loss. + """ + + epsilon: float = 0.1 + ignore_index: int = -100 + + def __call__(self, model_output, labels, shift_labels=False): + logits = model_output["logits"] if isinstance(model_output, dict) else model_output[0] + if shift_labels: + logits = logits[..., :-1, :].contiguous() + labels = labels[..., 1:].contiguous() + + log_probs = -nn.functional.log_softmax(logits, dim=-1) + if labels.dim() == log_probs.dim() - 1: + labels = labels.unsqueeze(-1) + + padding_mask = labels.eq(self.ignore_index) + # In case the ignore_index is -100, the gather will fail, so we replace labels by 0. The padding_mask + # will ignore them in any case. + labels = torch.clamp(labels, min=0) + nll_loss = log_probs.gather(dim=-1, index=labels) + # works for fp16 input tensor too, by internally upcasting it to fp32 + smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.float32) + + nll_loss.masked_fill_(padding_mask, 0.0) + smoothed_loss.masked_fill_(padding_mask, 0.0) + + # Take the mean over the label dimensions, then divide by the number of active elements (i.e. not-padded): + num_active_elements = padding_mask.numel() - padding_mask.long().sum() + nll_loss = nll_loss.sum() / num_active_elements + smoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1]) + return (1 - self.epsilon) * nll_loss + self.epsilon * smoothed_loss + + +def get_length_grouped_indices(lengths, batch_size, mega_batch_mult=None, generator=None): + """ + Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar + lengths. To do this, the indices are: + + - randomly permuted + - grouped in mega-batches of size `mega_batch_mult * batch_size` + - sorted by length in each mega-batch + + The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of + maximum length placed first, so that an OOM happens sooner rather than later. + """ + # Default for mega_batch_mult: 50 or the number to get 4 megabatches, whichever is smaller. + if mega_batch_mult is None: + mega_batch_mult = min(len(lengths) // (batch_size * 4), 50) + # Just in case, for tiny datasets + if mega_batch_mult == 0: + mega_batch_mult = 1 + + # We need to use torch for the random part as a distributed sampler will set the random seed for torch. + indices = torch.randperm(len(lengths), generator=generator) + megabatch_size = mega_batch_mult * batch_size + megabatches = [indices[i : i + megabatch_size].tolist() for i in range(0, len(lengths), megabatch_size)] + megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches] + + # The rest is to get the biggest batch first. + # Since each megabatch is sorted by descending length, the longest element is the first + megabatch_maximums = [lengths[megabatch[0]] for megabatch in megabatches] + max_idx = torch.argmax(torch.tensor(megabatch_maximums)).item() + # Switch to put the longest element in first position + megabatches[0][0], megabatches[max_idx][0] = megabatches[max_idx][0], megabatches[0][0] + + return [i for megabatch in megabatches for i in megabatch] + + +class LengthGroupedSampler(Sampler): + r""" + Sampler that samples indices in a way that groups together features of the dataset of roughly the same length while + keeping a bit of randomness. + """ + + def __init__( + self, + batch_size: int, + dataset: Optional[Dataset] = None, + lengths: Optional[List[int]] = None, + model_input_name: Optional[str] = None, + generator=None, + ): + if dataset is None and lengths is None: + raise ValueError("One of dataset and lengths must be provided.") + + self.batch_size = batch_size + if lengths is None: + model_input_name = model_input_name if model_input_name is not None else "input_ids" + if ( + not (isinstance(dataset[0], dict) or isinstance(dataset[0], BatchEncoding)) + or model_input_name not in dataset[0] + ): + raise ValueError( + "Can only automatically infer lengths for datasets whose items are dictionaries with an " + f"'{model_input_name}' key." + ) + lengths = [len(feature[model_input_name]) for feature in dataset] + elif isinstance(lengths, torch.Tensor): + logger.info( + "If lengths is a torch.Tensor, LengthGroupedSampler will be slow. Converting lengths to List[int]..." + ) + lengths = lengths.tolist() + + self.lengths = lengths + self.generator = generator + + def __len__(self): + return len(self.lengths) + + def __iter__(self): + indices = get_length_grouped_indices(self.lengths, self.batch_size, generator=self.generator) + return iter(indices) + + +class DistributedLengthGroupedSampler(DistributedSampler): + r""" + Distributed Sampler that samples indices in a way that groups together features of the dataset of roughly the same + length while keeping a bit of randomness. + """ + + # Copied and adapted from PyTorch DistributedSampler. + def __init__( + self, + batch_size: int, + dataset: Optional[Dataset] = None, + num_replicas: Optional[int] = None, + rank: Optional[int] = None, + seed: int = 0, + drop_last: bool = False, + lengths: Optional[List[int]] = None, + model_input_name: Optional[str] = None, + ): + if dataset is None and lengths is None: + raise ValueError("One of dataset and lengths must be provided.") + if num_replicas is None: + if not dist.is_available(): + raise RuntimeError("Requires distributed package to be available") + num_replicas = dist.get_world_size() + if rank is None: + if not dist.is_available(): + raise RuntimeError("Requires distributed package to be available") + rank = dist.get_rank() + + self.batch_size = batch_size + self.num_replicas = num_replicas + self.rank = rank + self.epoch = 0 + self.drop_last = drop_last + + if lengths is None: + model_input_name = model_input_name if model_input_name is not None else "input_ids" + if ( + not (isinstance(dataset[0], dict) or isinstance(dataset[0], BatchEncoding)) + or model_input_name not in dataset[0] + ): + raise ValueError( + "Can only automatically infer lengths for datasets whose items are dictionaries with an " + f"'{model_input_name}' key." + ) + lengths = [len(feature[model_input_name]) for feature in dataset] + elif isinstance(lengths, torch.Tensor): + logger.info( + "If lengths is a torch.Tensor, DistributedLengthGroupedSampler will be slow. Converting lengths to" + " List[int]..." + ) + lengths = lengths.tolist() + + self.lengths = lengths + + # If the dataset length is evenly divisible by # of replicas, then there + # is no need to drop any data, since the dataset will be split equally. + if self.drop_last and len(self.lengths) % self.num_replicas != 0: + # Split to nearest available length that is evenly divisible. + # This is to ensure each rank receives the same amount of data when + # using this Sampler. + self.num_samples = math.ceil((len(self.lengths) - self.num_replicas) / self.num_replicas) + else: + self.num_samples = math.ceil(len(self.lengths) / self.num_replicas) + self.total_size = self.num_samples * self.num_replicas + self.seed = seed + + def __iter__(self) -> Iterator: + # Deterministically shuffle based on epoch and seed + g = torch.Generator() + g.manual_seed(self.seed + self.epoch) + indices = get_length_grouped_indices(self.lengths, self.batch_size, generator=g) + + if not self.drop_last: + # add extra samples to make it evenly divisible + indices += indices[: (self.total_size - len(indices))] + else: + # remove tail of data to make it evenly divisible. + indices = indices[: self.total_size] + assert len(indices) == self.total_size + + # subsample + indices = indices[self.rank : self.total_size : self.num_replicas] + assert len(indices) == self.num_samples + + return iter(indices) + + +class ShardSampler(Sampler): + """ + Sampler that shards batches between several processes. Dispatches indices batch by batch: on 2 processes with batch + size 4, the first two batches are `[0, 1, 2, 3, 4, 5, 6, 7]` and `[8, 9, 10, 11, 12, 13, 14, 15]`, which shard into + `[0, 1, 2, 3]` and `[8, 9, 10, 11]` for GPU-0 and `[4, 5, 6, 7]` and `[12, 13, 14, 15]` for GPU-1. + + The sampler thus yields `[0, 1, 2, 3, 8, 9, 10, 11]` on GPU-0 and `[4, 5, 6, 7, 12, 13, 14, 15]` on GPU-1. + """ + + def __init__( + self, + dataset: Dataset, + batch_size: int = 1, + drop_last: bool = False, + num_processes: int = 1, + process_index: int = 0, + ): + self.dataset = dataset + self.batch_size = batch_size + self.drop_last = drop_last + self.num_processes = num_processes + self.process_index = process_index + + self.total_batch_size = total_batch_size = batch_size * num_processes + + num_batches = len(dataset) // total_batch_size if drop_last else math.ceil(len(dataset) / total_batch_size) + self.total_num_samples = num_batches * total_batch_size + + def __iter__(self): + indices = list(range(len(self.dataset))) + + # Add extra samples to make it evenly divisible. While loop is there in the edge case we have a tiny dataset + # and it needs to be done several times. + while len(indices) < self.total_num_samples: + indices += indices[: (self.total_num_samples - len(indices))] + + result = [] + for batch_start in range(self.batch_size * self.process_index, self.total_num_samples, self.total_batch_size): + result += indices[batch_start : batch_start + self.batch_size] + + return iter(result) + + def __len__(self): + # Each shard only sees a fraction of total_num_samples. + return self.total_num_samples // self.num_processes + + +class IterableDatasetShard(IterableDataset): + """ + Wraps a PyTorch `IterableDataset` to generate samples for one of the processes only. Instances of this class will + always yield a number of samples that is a round multiple of the actual batch size (which is `batch_size x + num_processes`). Depending on the value of the `drop_last` attribute, it will either stop the iteration at the + first batch that would be too small or loop with indices from the beginning. + + On two processes with an iterable dataset yielding of `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]` with a batch size of + 2: + + - the shard on process 0 will yield `[0, 1, 4, 5, 8, 9]` so will see batches `[0, 1]`, `[4, 5]`, `[8, 9]` + - the shard on process 1 will yield `[2, 3, 6, 7, 10, 11]` so will see batches `[2, 3]`, `[6, 7]`, `[10, 11]` + + + + If your IterableDataset implements some randomization that needs to be applied the same way on all processes + (for instance, a shuffling), you should use a `torch.Generator` in a `generator` attribute of the `dataset` to + generate your random numbers and call the [`~trainer_pt_utils.IterableDatasetShard.set_epoch`] method of this + object. It will set the seed of this `generator` to `seed + epoch` on all processes before starting the + iteration. Alternatively, you can also implement a `set_epoch()` method in your iterable dataset to deal with + this. + + + + Args: + dataset (`torch.utils.data.IterableDataset`): + The batch sampler to split in several shards. + batch_size (`int`, *optional*, defaults to 1): + The size of the batches per shard. + drop_last (`bool`, *optional*, defaults to `False`): + Whether or not to drop the last incomplete batch or complete the last batches by using the samples from the + beginning. + num_processes (`int`, *optional*, defaults to 1): + The number of processes running concurrently. + process_index (`int`, *optional*, defaults to 0): + The index of the current process. + seed (`int`, *optional*, defaults to 0): + A random seed that will be used for the random number generation in + [`~trainer_pt_utils.IterableDatasetShard.set_epoch`]. + """ + + def __init__( + self, + dataset: IterableDataset, + batch_size: int = 1, + drop_last: bool = False, + num_processes: int = 1, + process_index: int = 0, + seed: int = 0, + ): + self.dataset = dataset + self.batch_size = batch_size + self.drop_last = drop_last + self.num_processes = num_processes + self.process_index = process_index + self.seed = seed + self.epoch = 0 + self.num_examples = 0 + + def set_epoch(self, epoch): + self.epoch = epoch + if hasattr(self.dataset, "set_epoch"): + self.dataset.set_epoch(epoch) + + def __iter__(self): + self.num_examples = 0 + if ( + not hasattr(self.dataset, "set_epoch") + and hasattr(self.dataset, "generator") + and isinstance(self.dataset.generator, torch.Generator) + ): + self.dataset.generator.manual_seed(self.seed + self.epoch) + real_batch_size = self.batch_size * self.num_processes + process_slice = range(self.process_index * self.batch_size, (self.process_index + 1) * self.batch_size) + + first_batch = None + current_batch = [] + for element in self.dataset: + self.num_examples += 1 + current_batch.append(element) + # Wait to have a full batch before yielding elements. + if len(current_batch) == real_batch_size: + for i in process_slice: + yield current_batch[i] + if first_batch is None: + first_batch = current_batch.copy() + current_batch = [] + + # Finished if drop_last is True, otherwise complete the last batch with elements from the beginning. + if not self.drop_last and len(current_batch) > 0: + if first_batch is None: + first_batch = current_batch.copy() + while len(current_batch) < real_batch_size: + current_batch += first_batch + for i in process_slice: + yield current_batch[i] + + def __len__(self): + # Will raise an error if the underlying dataset is not sized. + if self.drop_last: + return (len(self.dataset) // (self.batch_size * self.num_processes)) * self.batch_size + else: + return math.ceil(len(self.dataset) / (self.batch_size * self.num_processes)) * self.batch_size + + +# In order to keep `trainer.py` compact and easy to understand, place any secondary PT Trainer +# helper methods here + + +def _get_learning_rate(self): + if self.deepspeed: + # with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may + # not run for the first few dozen steps while loss scale is too large, and thus during + # that time `get_last_lr` will fail if called during that warm up stage, so work around it: + try: + last_lr = self.lr_scheduler.get_last_lr()[0] + except AssertionError as e: + if "need to call step" in str(e): + logger.warning("tried to get lr value before scheduler/optimizer started stepping, returning lr=0") + last_lr = 0 + else: + raise + else: + last_lr = self.lr_scheduler.get_last_lr()[0] + if torch.is_tensor(last_lr): + last_lr = last_lr.item() + return last_lr + + +def _secs2timedelta(secs): + """ + convert seconds to hh:mm:ss.msec, msecs rounded to 2 decimals + """ + + msec = int(abs(secs - int(secs)) * 100) + return f"{datetime.timedelta(seconds=int(secs))}.{msec:02d}" + + +def metrics_format(self, metrics: Dict[str, float]) -> Dict[str, float]: + """ + Reformat Trainer metrics values to a human-readable format + + Args: + metrics (`Dict[str, float]`): + The metrics returned from train/evaluate/predict + + Returns: + metrics (`Dict[str, float]`): The reformatted metrics + """ + + metrics_copy = metrics.copy() + for k, v in metrics_copy.items(): + if "_mem_" in k: + metrics_copy[k] = f"{ v >> 20 }MB" + elif "_runtime" in k: + metrics_copy[k] = _secs2timedelta(v) + elif k == "total_flos": + metrics_copy[k] = f"{ int(v) >> 30 }GF" + elif type(metrics_copy[k]) == float: + metrics_copy[k] = round(v, 4) + + return metrics_copy + + +def log_metrics(self, split, metrics): + """ + Log metrics in a specially formatted way + + Under distributed environment this is done only for a process with rank 0. + + Args: + split (`str`): + Mode/split name: one of `train`, `eval`, `test` + metrics (`Dict[str, float]`): + The metrics returned from train/evaluate/predictmetrics: metrics dict + + Notes on memory reports: + + In order to get memory usage report you need to install `psutil`. You can do that with `pip install psutil`. + + Now when this method is run, you will see a report that will include: : + + ``` + init_mem_cpu_alloc_delta = 1301MB + init_mem_cpu_peaked_delta = 154MB + init_mem_gpu_alloc_delta = 230MB + init_mem_gpu_peaked_delta = 0MB + train_mem_cpu_alloc_delta = 1345MB + train_mem_cpu_peaked_delta = 0MB + train_mem_gpu_alloc_delta = 693MB + train_mem_gpu_peaked_delta = 7MB + ``` + + **Understanding the reports:** + + - the first segment, e.g., `train__`, tells you which stage the metrics are for. Reports starting with `init_` + will be added to the first stage that gets run. So that if only evaluation is run, the memory usage for the + `__init__` will be reported along with the `eval_` metrics. + - the third segment, is either `cpu` or `gpu`, tells you whether it's the general RAM or the gpu0 memory + metric. + - `*_alloc_delta` - is the difference in the used/allocated memory counter between the end and the start of the + stage - it can be negative if a function released more memory than it allocated. + - `*_peaked_delta` - is any extra memory that was consumed and then freed - relative to the current allocated + memory counter - it is never negative. When you look at the metrics of any stage you add up `alloc_delta` + + `peaked_delta` and you know how much memory was needed to complete that stage. + + The reporting happens only for process of rank 0 and gpu 0 (if there is a gpu). Typically this is enough since the + main process does the bulk of work, but it could be not quite so if model parallel is used and then other GPUs may + use a different amount of gpu memory. This is also not the same under DataParallel where gpu0 may require much more + memory than the rest since it stores the gradient and optimizer states for all participating GPUS. Perhaps in the + future these reports will evolve to measure those too. + + The CPU RAM metric measures RSS (Resident Set Size) includes both the memory which is unique to the process and the + memory shared with other processes. It is important to note that it does not include swapped out memory, so the + reports could be imprecise. + + The CPU peak memory is measured using a sampling thread. Due to python's GIL it may miss some of the peak memory if + that thread didn't get a chance to run when the highest memory was used. Therefore this report can be less than + reality. Using `tracemalloc` would have reported the exact peak memory, but it doesn't report memory allocations + outside of python. So if some C++ CUDA extension allocated its own memory it won't be reported. And therefore it + was dropped in favor of the memory sampling approach, which reads the current process memory usage. + + The GPU allocated and peak memory reporting is done with `torch.cuda.memory_allocated()` and + `torch.cuda.max_memory_allocated()`. This metric reports only "deltas" for pytorch-specific allocations, as + `torch.cuda` memory management system doesn't track any memory allocated outside of pytorch. For example, the very + first cuda call typically loads CUDA kernels, which may take from 0.5 to 2GB of GPU memory. + + Note that this tracker doesn't account for memory allocations outside of [`Trainer`]'s `__init__`, `train`, + `evaluate` and `predict` calls. + + Because `evaluation` calls may happen during `train`, we can't handle nested invocations because + `torch.cuda.max_memory_allocated` is a single counter, so if it gets reset by a nested eval call, `train`'s tracker + will report incorrect info. If this [pytorch issue](https://github.com/pytorch/pytorch/issues/16266) gets resolved + it will be possible to change this class to be re-entrant. Until then we will only track the outer level of + `train`, `evaluate` and `predict` methods. Which means that if `eval` is called during `train`, it's the latter + that will account for its memory usage and that of the former. + + This also means that if any other tool that is used along the [`Trainer`] calls + `torch.cuda.reset_peak_memory_stats`, the gpu peak memory stats could be invalid. And the [`Trainer`] will disrupt + the normal behavior of any such tools that rely on calling `torch.cuda.reset_peak_memory_stats` themselves. + + For best performance you may want to consider turning the memory profiling off for production runs. + """ + if not self.is_world_process_zero(): + return + + print(f"***** {split} metrics *****") + metrics_formatted = self.metrics_format(metrics) + k_width = max(len(str(x)) for x in metrics_formatted.keys()) + v_width = max(len(str(x)) for x in metrics_formatted.values()) + for key in sorted(metrics_formatted.keys()): + print(f" {key: <{k_width}} = {metrics_formatted[key]:>{v_width}}") + + +def save_metrics(self, split, metrics, combined=True): + """ + Save metrics into a json file for that split, e.g. `train_results.json`. + + Under distributed environment this is done only for a process with rank 0. + + Args: + split (`str`): + Mode/split name: one of `train`, `eval`, `test`, `all` + metrics (`Dict[str, float]`): + The metrics returned from train/evaluate/predict + combined (`bool`, *optional*, defaults to `True`): + Creates combined metrics by updating `all_results.json` with metrics of this call + + To understand the metrics please read the docstring of [`~Trainer.log_metrics`]. The only difference is that raw + unformatted numbers are saved in the current method. + + """ + if not self.is_world_process_zero(): + return + + path = os.path.join(self.args.output_dir, f"{split}_results.json") + with open(path, "w") as f: + json.dump(metrics, f, indent=4, sort_keys=True) + + if combined: + path = os.path.join(self.args.output_dir, "all_results.json") + if os.path.exists(path): + with open(path, "r") as f: + all_metrics = json.load(f) + else: + all_metrics = {} + + all_metrics.update(metrics) + with open(path, "w") as f: + json.dump(all_metrics, f, indent=4, sort_keys=True) + + +def save_state(self): + """ + Saves the Trainer state, since Trainer.save_model saves only the tokenizer with the model + + Under distributed environment this is done only for a process with rank 0. + """ + if not self.is_world_process_zero(): + return + + path = os.path.join(self.args.output_dir, "trainer_state.json") + self.state.save_to_json(path) + + +def get_model_param_count(model, trainable_only=False): + """ + Calculate model's total param count. If trainable_only is True then count only those requiring grads + """ + if is_deepspeed_zero3_enabled(): + + def numel(p): + return p.ds_numel + + else: + + def numel(p): + return p.numel() + + return sum(numel(p) for p in model.parameters() if not trainable_only or p.requires_grad) + + +def get_parameter_names(model, forbidden_layer_types): + """ + Returns the names of the model parameters that are not inside a forbidden layer. + """ + result = [] + for name, child in model.named_children(): + result += [ + f"{name}.{n}" + for n in get_parameter_names(child, forbidden_layer_types) + if not isinstance(child, tuple(forbidden_layer_types)) + ] + # Add model specific parameters (defined with nn.Parameter) since they are not in any child. + result += list(model._parameters.keys()) + return result + + +def get_module_class_from_name(module, name): + """ + Gets a class from a module by its name. + + Args: + module (`torch.nn.Module`): The module to get the class from. + name (`str`): The name of the class. + """ + modules_children = list(module.children()) + if module.__class__.__name__ == name: + return module.__class__ + elif len(modules_children) == 0: + return + else: + for child_module in modules_children: + module_class = get_module_class_from_name(child_module, name) + if module_class is not None: + return module_class + + +if is_sagemaker_mp_enabled(): + import smdistributed.modelparallel.torch as smp + + @smp.step() + def smp_forward_backward(model, inputs, gradient_accumulation_steps=1): + outputs = model(**inputs) + loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0] + loss /= gradient_accumulation_steps + model.backward(loss) + return loss + + @smp.step() + def smp_forward_only(model, inputs): + return model(**inputs) + + def smp_gather(tensor): + if isinstance(tensor, (list, tuple)): + return type(tensor)(smp_gather(t) for t in tensor) + elif isinstance(tensor, dict): + return type(tensor)({k: smp_gather(v) for k, v in tensor.items()}) + elif not isinstance(tensor, torch.Tensor): + raise TypeError( + f"Can't gather the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors." + ) + all_tensors = smp.allgather(tensor, smp.CommGroup.DP_GROUP) + all_tensors = [atleast_1d(t) for t in all_tensors] + return torch.cat([t.cpu() for t in all_tensors], dim=0) + + def smp_nested_concat(tensor): + if isinstance(tensor, (list, tuple)): + return type(tensor)(smp_nested_concat(t) for t in tensor) + elif isinstance(tensor, dict): + return type(tensor)({k: smp_nested_concat(v) for k, v in tensor.items()}) + # It doesn't seem possible to check here if `tensor` is a StepOutput because StepOutput lives in `smp.step` + # which is also the name of the decorator so Python is confused. + return tensor.concat().detach().cpu() diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/training_args.py b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/training_args.py new file mode 100644 index 0000000000..c9560cd8c0 --- /dev/null +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/transformers_modify/training_args.py @@ -0,0 +1,2458 @@ +# Copyright 2020 The HuggingFace Team. All rights reserved. +# Copyright 2023 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import contextlib +import io +import json +import math +import os +import warnings +from dataclasses import asdict, dataclass, field, fields +from datetime import timedelta +from enum import Enum +from pathlib import Path +from typing import Any, Dict, List, Optional, Union + +from packaging import version + +from .debug_utils import DebugOption +from .trainer_utils import ( + EvaluationStrategy, + FSDPOption, + HubStrategy, + IntervalStrategy, + SchedulerType, + ShardedDDPOption, +) +from .utils import ( + ExplicitEnum, + cached_property, + ccl_version, + get_full_repo_name, + is_accelerate_available, + is_psutil_available, + is_safetensors_available, + is_sagemaker_dp_enabled, + is_sagemaker_mp_enabled, + is_torch_available, + is_torch_bf16_cpu_available, + is_torch_bf16_gpu_available, + is_torch_neuroncore_available, + is_torch_tf32_available, + is_torch_tpu_available, + logging, + requires_backends, +) +from .utils.import_utils import is_optimum_neuron_available + + +logger = logging.get_logger(__name__) +log_levels = logging.get_log_levels_dict().copy() +trainer_log_levels = dict(**log_levels, passive=-1) + +if is_torch_available(): + import torch + import torch.distributed as dist + +if is_torch_tpu_available(check_device=False): + import torch_xla.core.xla_model as xm + +if is_torch_neuroncore_available(check_device=False): + # torchrun support + # https://github.com/pytorch/xla/pull/3609 + if os.environ.get("TORCHELASTIC_RUN_ID"): + if is_optimum_neuron_available(): + logger.info( + "Make sure that you are performing the training with the TrainiumTrainer from optimum[neuron], this " + "will fail otherwise." + ) + else: + logger.warning( + "Please use the TrainiumTrainer from optimum[neuron] instead of the Transformers library to perform " + "training on AWS Trainium instances. More information here: " + "https://github.com/huggingface/optimum-neuron" + ) + import torch_xla.distributed.xla_backend as xbn + + if not isinstance(torch.distributed.group.WORLD, xbn.ProcessGroupXla): + torch.distributed.init_process_group(backend="xla") + if not isinstance(torch.distributed.group.WORLD, xbn.ProcessGroupXla): + raise AssertionError("Failed to initialize torch.distributed process group using XLA backend.") + + +if is_sagemaker_mp_enabled(): + import smdistributed.modelparallel.torch as smp + + smp.init() + + +def default_logdir() -> str: + """ + Same default as PyTorch + """ + import socket + from datetime import datetime + + current_time = datetime.now().strftime("%b%d_%H-%M-%S") + return os.path.join("runs", current_time + "_" + socket.gethostname()) + + +def get_int_from_env(env_keys, default): + """Returns the first positive env value found in the `env_keys` list or the default.""" + for e in env_keys: + val = int(os.environ.get(e, -1)) + if val >= 0: + return val + return default + + +def get_xla_device_type(device: "torch.device") -> Optional[str]: + """ + Returns the xla device type (CPU|GPU|TPU) or None if the device is a non-xla device. + """ + if is_torch_tpu_available(): + return xm.xla_real_devices([device])[0].split(":")[0] + return None + + +class OptimizerNames(ExplicitEnum): + """ + Stores the acceptable string identifiers for optimizers. + """ + + ADAMW_HF = "adamw_hf" + ADAMW_TORCH = "adamw_torch" + ADAMW_TORCH_FUSED = "adamw_torch_fused" + ADAMW_TORCH_XLA = "adamw_torch_xla" + ADAMW_APEX_FUSED = "adamw_apex_fused" + ADAFACTOR = "adafactor" + ADAMW_BNB = "adamw_bnb_8bit" + ADAMW_ANYPRECISION = "adamw_anyprecision" + SGD = "sgd" + ADAGRAD = "adagrad" + ADAMW_TORCH_NPU_FUSED = "adamw_torch_npu_fused" + + +@dataclass +class TrainingArguments: + """ + TrainingArguments is the subset of the arguments we use in our example scripts **which relate to the training loop + itself**. + + Using [`HfArgumentParser`] we can turn this class into + [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the + command line. + + Parameters: + output_dir (`str`): + The output directory where the model predictions and checkpoints will be written. + overwrite_output_dir (`bool`, *optional*, defaults to `False`): + If `True`, overwrite the content of the output directory. Use this to continue training if `output_dir` + points to a checkpoint directory. + do_train (`bool`, *optional*, defaults to `False`): + Whether to run training or not. This argument is not directly used by [`Trainer`], it's intended to be used + by your training/evaluation scripts instead. See the [example + scripts](https://github.com/huggingface/transformers/tree/main/examples) for more details. + do_eval (`bool`, *optional*): + Whether to run evaluation on the validation set or not. Will be set to `True` if `evaluation_strategy` is + different from `"no"`. This argument is not directly used by [`Trainer`], it's intended to be used by your + training/evaluation scripts instead. See the [example + scripts](https://github.com/huggingface/transformers/tree/main/examples) for more details. + do_predict (`bool`, *optional*, defaults to `False`): + Whether to run predictions on the test set or not. This argument is not directly used by [`Trainer`], it's + intended to be used by your training/evaluation scripts instead. See the [example + scripts](https://github.com/huggingface/transformers/tree/main/examples) for more details. + evaluation_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"no"`): + The evaluation strategy to adopt during training. Possible values are: + + - `"no"`: No evaluation is done during training. + - `"steps"`: Evaluation is done (and logged) every `eval_steps`. + - `"epoch"`: Evaluation is done at the end of each epoch. + + prediction_loss_only (`bool`, *optional*, defaults to `False`): + When performing evaluation and generating predictions, only returns the loss. + per_device_train_batch_size (`int`, *optional*, defaults to 8): + The batch size per GPU/TPU core/CPU for training. + per_device_eval_batch_size (`int`, *optional*, defaults to 8): + The batch size per GPU/TPU core/CPU for evaluation. + gradient_accumulation_steps (`int`, *optional*, defaults to 1): + Number of updates steps to accumulate the gradients for, before performing a backward/update pass. + + + + When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging, + evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training examples. + + + + eval_accumulation_steps (`int`, *optional*): + Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If + left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but + requires more memory). + eval_delay (`float`, *optional*): + Number of epochs or steps to wait for before the first evaluation can be performed, depending on the + evaluation_strategy. + learning_rate (`float`, *optional*, defaults to 5e-5): + The initial learning rate for [`AdamW`] optimizer. + weight_decay (`float`, *optional*, defaults to 0): + The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in [`AdamW`] + optimizer. + adam_beta1 (`float`, *optional*, defaults to 0.9): + The beta1 hyperparameter for the [`AdamW`] optimizer. + adam_beta2 (`float`, *optional*, defaults to 0.999): + The beta2 hyperparameter for the [`AdamW`] optimizer. + adam_epsilon (`float`, *optional*, defaults to 1e-8): + The epsilon hyperparameter for the [`AdamW`] optimizer. + max_grad_norm (`float`, *optional*, defaults to 1.0): + Maximum gradient norm (for gradient clipping). + num_train_epochs(`float`, *optional*, defaults to 3.0): + Total number of training epochs to perform (if not an integer, will perform the decimal part percents of + the last epoch before stopping training). + max_steps (`int`, *optional*, defaults to -1): + If set to a positive number, the total number of training steps to perform. Overrides `num_train_epochs`. + In case of using a finite iterable dataset the training may stop before reaching the set number of steps + when all data is exhausted + lr_scheduler_type (`str` or [`SchedulerType`], *optional*, defaults to `"linear"`): + The scheduler type to use. See the documentation of [`SchedulerType`] for all possible values. + warmup_ratio (`float`, *optional*, defaults to 0.0): + Ratio of total training steps used for a linear warmup from 0 to `learning_rate`. + warmup_steps (`int`, *optional*, defaults to 0): + Number of steps used for a linear warmup from 0 to `learning_rate`. Overrides any effect of `warmup_ratio`. + log_level (`str`, *optional*, defaults to `passive`): + Logger log level to use on the main process. Possible choices are the log levels as strings: 'debug', + 'info', 'warning', 'error' and 'critical', plus a 'passive' level which doesn't set anything and keeps the + current log level for the Transformers library (which will be `"warning"` by default). + log_level_replica (`str`, *optional*, defaults to `"warning"`): + Logger log level to use on replicas. Same choices as `log_level`" + log_on_each_node (`bool`, *optional*, defaults to `True`): + In multinode distributed training, whether to log using `log_level` once per node, or only on the main + node. + logging_dir (`str`, *optional*): + [TensorBoard](https://www.tensorflow.org/tensorboard) log directory. Will default to + *output_dir/runs/**CURRENT_DATETIME_HOSTNAME***. + logging_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`): + The logging strategy to adopt during training. Possible values are: + + - `"no"`: No logging is done during training. + - `"epoch"`: Logging is done at the end of each epoch. + - `"steps"`: Logging is done every `logging_steps`. + + logging_first_step (`bool`, *optional*, defaults to `False`): + Whether to log and evaluate the first `global_step` or not. + logging_steps (`int`, *optional*, defaults to 500): + Number of update steps between two logs if `logging_strategy="steps"`. + logging_nan_inf_filter (`bool`, *optional*, defaults to `True`): + Whether to filter `nan` and `inf` losses for logging. If set to `True` the loss of every step that is `nan` + or `inf` is filtered and the average loss of the current logging window is taken instead. + + + + `logging_nan_inf_filter` only influences the logging of loss values, it does not change the behavior the + gradient is computed or applied to the model. + + + + save_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`): + The checkpoint save strategy to adopt during training. Possible values are: + + - `"no"`: No save is done during training. + - `"epoch"`: Save is done at the end of each epoch. + - `"steps"`: Save is done every `save_steps`. + save_steps (`int`, *optional*, defaults to 500): + Number of updates steps before two checkpoint saves if `save_strategy="steps"`. + save_total_limit (`int`, *optional*): + If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in + `output_dir`. + save_safetensors (`bool`, *optional*, defaults to `False`): + Use [safetensors](https://huggingface.co/docs/safetensors) saving and loading for state dicts instead of + default `torch.load` and `torch.save`. + save_on_each_node (`bool`, *optional*, defaults to `False`): + When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on + the main one. + + This should not be activated when the different nodes use the same storage as the files will be saved with + the same names for each node. + no_cuda (`bool`, *optional*, defaults to `False`): + Whether to not use CUDA even when it is available or not. + seed (`int`, *optional*, defaults to 42): + Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the + [`~Trainer.model_init`] function to instantiate the model if it has some randomly initialized parameters. + data_seed (`int`, *optional*): + Random seed to be used with data samplers. If not set, random generators for data sampling will use the + same seed as `seed`. This can be used to ensure reproducibility of data sampling, independent of the model + seed. + jit_mode_eval (`bool`, *optional*, defaults to `False`): + Whether or not to use PyTorch jit trace for inference. + use_ipex (`bool`, *optional*, defaults to `False`): + Use Intel extension for PyTorch when it is available. [IPEX + installation](https://github.com/intel/intel-extension-for-pytorch). + bf16 (`bool`, *optional*, defaults to `False`): + Whether to use bf16 16-bit (mixed) precision training instead of 32-bit training. Requires Ampere or higher + NVIDIA architecture or using CPU (no_cuda). This is an experimental API and it may change. + fp16 (`bool`, *optional*, defaults to `False`): + Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training. + fp16_opt_level (`str`, *optional*, defaults to 'O1'): + For `fp16` training, Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details on + the [Apex documentation](https://nvidia.github.io/apex/amp). + fp16_backend (`str`, *optional*, defaults to `"auto"`): + This argument is deprecated. Use `half_precision_backend` instead. + half_precision_backend (`str`, *optional*, defaults to `"auto"`): + The backend to use for mixed precision training. Must be one of `"auto", "cuda_amp", "apex", "cpu_amp"`. + `"auto"` will use CPU/CUDA AMP or APEX depending on the PyTorch version detected, while the other choices + will force the requested backend. + bf16_full_eval (`bool`, *optional*, defaults to `False`): + Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm + metric values. This is an experimental API and it may change. + fp16_full_eval (`bool`, *optional*, defaults to `False`): + Whether to use full float16 evaluation instead of 32-bit. This will be faster and save memory but can harm + metric values. + tf32 (`bool`, *optional*): + Whether to enable the TF32 mode, available in Ampere and newer GPU architectures. The default value depends + on PyTorch's version default of `torch.backends.cuda.matmul.allow_tf32`. For more details please refer to + the [TF32](https://huggingface.co/docs/transformers/performance#tf32) documentation. This is an + experimental API and it may change. + local_rank (`int`, *optional*, defaults to -1): + Rank of the process during distributed training. + xpu_backend (`str`, *optional*): + The backend to use for xpu distributed training. Must be one of `"mpi"` or `"ccl"` or `"gloo"`. + tpu_num_cores (`int`, *optional*): + When training on TPU, the number of TPU cores (automatically passed by launcher script). + dataloader_drop_last (`bool`, *optional*, defaults to `False`): + Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) + or not. + eval_steps (`int`, *optional*): + Number of update steps between two evaluations if `evaluation_strategy="steps"`. Will default to the same + value as `logging_steps` if not set. + dataloader_num_workers (`int`, *optional*, defaults to 0): + Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the + main process. + past_index (`int`, *optional*, defaults to -1): + Some models like [TransformerXL](../model_doc/transformerxl) or [XLNet](../model_doc/xlnet) can make use of + the past hidden states for their predictions. If this argument is set to a positive int, the `Trainer` will + use the corresponding output (usually index 2) as the past state and feed it to the model at the next + training step under the keyword argument `mems`. + run_name (`str`, *optional*): + A descriptor for the run. Typically used for [wandb](https://www.wandb.com/) and + [mlflow](https://www.mlflow.org/) logging. + disable_tqdm (`bool`, *optional*): + Whether or not to disable the tqdm progress bars and table of metrics produced by + [`~notebook.NotebookTrainingTracker`] in Jupyter Notebooks. Will default to `True` if the logging level is + set to warn or lower (default), `False` otherwise. + remove_unused_columns (`bool`, *optional*, defaults to `True`): + Whether or not to automatically remove the columns unused by the model forward method. + + (Note that this behavior is not implemented for [`TFTrainer`] yet.) + label_names (`List[str]`, *optional*): + The list of keys in your dictionary of inputs that correspond to the labels. + + Will eventually default to the list of argument names accepted by the model that contain the word "label", + except if the model used is one of the `XxxForQuestionAnswering` in which case it will also include the + `["start_positions", "end_positions"]` keys. + load_best_model_at_end (`bool`, *optional*, defaults to `False`): + Whether or not to load the best model found during training at the end of training. + + + + When set to `True`, the parameters `save_strategy` needs to be the same as `evaluation_strategy`, and in + the case it is "steps", `save_steps` must be a round multiple of `eval_steps`. + + + + metric_for_best_model (`str`, *optional*): + Use in conjunction with `load_best_model_at_end` to specify the metric to use to compare two different + models. Must be the name of a metric returned by the evaluation with or without the prefix `"eval_"`. Will + default to `"loss"` if unspecified and `load_best_model_at_end=True` (to use the evaluation loss). + + If you set this value, `greater_is_better` will default to `True`. Don't forget to set it to `False` if + your metric is better when lower. + greater_is_better (`bool`, *optional*): + Use in conjunction with `load_best_model_at_end` and `metric_for_best_model` to specify if better models + should have a greater metric or not. Will default to: + + - `True` if `metric_for_best_model` is set to a value that isn't `"loss"` or `"eval_loss"`. + - `False` if `metric_for_best_model` is not set, or set to `"loss"` or `"eval_loss"`. + ignore_data_skip (`bool`, *optional*, defaults to `False`): + When resuming training, whether or not to skip the epochs and batches to get the data loading at the same + stage as in the previous training. If set to `True`, the training will begin faster (as that skipping step + can take a long time) but will not yield the same results as the interrupted training would have. + sharded_ddp (`bool`, `str` or list of [`~trainer_utils.ShardedDDPOption`], *optional*, defaults to `False`): + Use Sharded DDP training from [FairScale](https://github.com/facebookresearch/fairscale) (in distributed + training only). This is an experimental feature. + + A list of options along the following: + + - `"simple"`: to use first instance of sharded DDP released by fairscale (`ShardedDDP`) similar to ZeRO-2. + - `"zero_dp_2"`: to use the second instance of sharded DPP released by fairscale (`FullyShardedDDP`) in + Zero-2 mode (with `reshard_after_forward=False`). + - `"zero_dp_3"`: to use the second instance of sharded DPP released by fairscale (`FullyShardedDDP`) in + Zero-3 mode (with `reshard_after_forward=True`). + - `"offload"`: to add ZeRO-offload (only compatible with `"zero_dp_2"` and `"zero_dp_3"`). + + If a string is passed, it will be split on space. If a bool is passed, it will be converted to an empty + list for `False` and `["simple"]` for `True`. + fsdp (`bool`, `str` or list of [`~trainer_utils.FSDPOption`], *optional*, defaults to `False`): + Use PyTorch Distributed Parallel Training (in distributed training only). + + A list of options along the following: + + - `"full_shard"`: Shard parameters, gradients and optimizer states. + - `"shard_grad_op"`: Shard optimizer states and gradients. + - `"offload"`: Offload parameters and gradients to CPUs (only compatible with `"full_shard"` and + `"shard_grad_op"`). + - `"auto_wrap"`: Automatically recursively wrap layers with FSDP using `default_auto_wrap_policy`. + fsdp_config (`str` or `dict`, *optional*): + Config to be used with fsdp (Pytorch Distributed Parallel Training). The value is either a location of + deepspeed json config file (e.g., `ds_config.json`) or an already loaded json file as `dict`. + + A List of config and its options: + - fsdp_min_num_params (`int`, *optional*, defaults to `0`): + FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when `fsdp` field is + passed). + - fsdp_transformer_layer_cls_to_wrap (`List[str]`, *optional*): + List of transformer layer class names (case-sensitive) to wrap, e.g, `BertLayer`, `GPTJBlock`, + `T5Block` .... (useful only when `fsdp` flag is passed). + - fsdp_backward_prefetch (`str`, *optional*) + FSDP's backward prefetch mode. Controls when to prefetch next set of parameters (useful only when + `fsdp` field is passed). + + A list of options along the following: + + - `"backward_pre"` : Prefetches the next set of parameters before the current set of parameter's + gradient + computation. + - `"backward_pos"` : This prefetches the next set of parameters after the current set of + parameter’s + gradient computation. + - fsdp_forward_prefetch (`bool`, *optional*, defaults to `False`) + FSDP's forward prefetch mode (useful only when `fsdp` field is passed). + If `"True"`, then FSDP explicitly prefetches the next upcoming all-gather while executing in the + forward pass. + - limit_all_gathers (`bool`, *optional*, defaults to `False`) + FSDP's limit_all_gathers (useful only when `fsdp` field is passed). + If `"True"`, FSDP explicitly synchronizes the CPU thread to prevent too many in-flight + all-gathers. + - xla (`bool`, *optional*, defaults to `False`): + Whether to use PyTorch/XLA Fully Sharded Data Parallel Training. This is an experimental feature + and its API may evolve in the future. + - xla_fsdp_settings (`dict`, *optional*) + The value is a dictionary which stores the XLA FSDP wrapping parameters. + + For a complete list of options, please see [here]( + https://github.com/pytorch/xla/blob/master/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py). + - xla_fsdp_grad_ckpt (`bool`, *optional*, defaults to `False`): + Will use gradient checkpointing over each nested XLA FSDP wrapped layer. This setting can only be + used when the xla flag is set to true, and an auto wrapping policy is specified through + fsdp_min_num_params or fsdp_transformer_layer_cls_to_wrap. + + deepspeed (`str` or `dict`, *optional*): + Use [Deepspeed](https://github.com/microsoft/deepspeed). This is an experimental feature and its API may + evolve in the future. The value is either the location of DeepSpeed json config file (e.g., + `ds_config.json`) or an already loaded json file as a `dict`" + label_smoothing_factor (`float`, *optional*, defaults to 0.0): + The label smoothing factor to use. Zero means no label smoothing, otherwise the underlying onehot-encoded + labels are changed from 0s and 1s to `label_smoothing_factor/num_labels` and `1 - label_smoothing_factor + + label_smoothing_factor/num_labels` respectively. + debug (`str` or list of [`~debug_utils.DebugOption`], *optional*, defaults to `""`): + Enable one or more debug features. This is an experimental feature. + + Possible options are: + + - `"underflow_overflow"`: detects overflow in model's input/outputs and reports the last frames that led to + the event + - `"tpu_metrics_debug"`: print debug metrics on TPU + + The options should be separated by whitespaces. + optim (`str` or [`training_args.OptimizerNames`], *optional*, defaults to `"adamw_hf"`): + The optimizer to use: adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision or + adafactor. + optim_args (`str`, *optional*): + Optional arguments that are supplied to AnyPrecisionAdamW. + group_by_length (`bool`, *optional*, defaults to `False`): + Whether or not to group together samples of roughly the same length in the training dataset (to minimize + padding applied and be more efficient). Only useful if applying dynamic padding. + length_column_name (`str`, *optional*, defaults to `"length"`): + Column name for precomputed lengths. If the column exists, grouping by length will use these values rather + than computing them on train startup. Ignored unless `group_by_length` is `True` and the dataset is an + instance of `Dataset`. + report_to (`str` or `List[str]`, *optional*, defaults to `"all"`): + The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`, + `"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. Use `"all"` to report to + all integrations installed, `"none"` for no integrations. + ddp_find_unused_parameters (`bool`, *optional*): + When using distributed training, the value of the flag `find_unused_parameters` passed to + `DistributedDataParallel`. Will default to `False` if gradient checkpointing is used, `True` otherwise. + ddp_bucket_cap_mb (`int`, *optional*): + When using distributed training, the value of the flag `bucket_cap_mb` passed to `DistributedDataParallel`. + dataloader_pin_memory (`bool`, *optional*, defaults to `True`): + Whether you want to pin memory in data loaders or not. Will default to `True`. + skip_memory_metrics (`bool`, *optional*, defaults to `True`): + Whether to skip adding of memory profiler reports to metrics. This is skipped by default because it slows + down the training and evaluation speed. + push_to_hub (`bool`, *optional*, defaults to `False`): + Whether or not to push the model to the Hub every time the model is saved. If this is activated, + `output_dir` will begin a git directory synced with the repo (determined by `hub_model_id`) and the content + will be pushed each time a save is triggered (depending on your `save_strategy`). Calling + [`~Trainer.save_model`] will also trigger a push. + + + + If `output_dir` exists, it needs to be a local clone of the repository to which the [`Trainer`] will be + pushed. + + + + resume_from_checkpoint (`str`, *optional*): + The path to a folder with a valid checkpoint for your model. This argument is not directly used by + [`Trainer`], it's intended to be used by your training/evaluation scripts instead. See the [example + scripts](https://github.com/huggingface/transformers/tree/main/examples) for more details. + hub_model_id (`str`, *optional*): + The name of the repository to keep in sync with the local *output_dir*. It can be a simple model ID in + which case the model will be pushed in your namespace. Otherwise it should be the whole repository name, + for instance `"user_name/model"`, which allows you to push to an organization you are a member of with + `"organization_name/model"`. Will default to `user_name/output_dir_name` with *output_dir_name* being the + name of `output_dir`. + + Will default to the name of `output_dir`. + hub_strategy (`str` or [`~trainer_utils.HubStrategy`], *optional*, defaults to `"every_save"`): + Defines the scope of what is pushed to the Hub and when. Possible values are: + + - `"end"`: push the model, its configuration, the tokenizer (if passed along to the [`Trainer`]) and a + draft of a model card when the [`~Trainer.save_model`] method is called. + - `"every_save"`: push the model, its configuration, the tokenizer (if passed along to the [`Trainer`]) and + a draft of a model card each time there is a model save. The pushes are asynchronous to not block + training, and in case the save are very frequent, a new push is only attempted if the previous one is + finished. A last push is made with the final model at the end of training. + - `"checkpoint"`: like `"every_save"` but the latest checkpoint is also pushed in a subfolder named + last-checkpoint, allowing you to resume training easily with + `trainer.train(resume_from_checkpoint="last-checkpoint")`. + - `"all_checkpoints"`: like `"checkpoint"` but all checkpoints are pushed like they appear in the output + folder (so you will get one checkpoint folder per folder in your final repository) + + hub_token (`str`, *optional*): + The token to use to push the model to the Hub. Will default to the token in the cache folder obtained with + `huggingface-cli login`. + hub_private_repo (`bool`, *optional*, defaults to `False`): + If True, the Hub repo will be set to private. + gradient_checkpointing (`bool`, *optional*, defaults to `False`): + If True, use gradient checkpointing to save memory at the expense of slower backward pass. + include_inputs_for_metrics (`bool`, *optional*, defaults to `False`): + Whether or not the inputs will be passed to the `compute_metrics` function. This is intended for metrics + that need inputs, predictions and references for scoring calculation in Metric class. + auto_find_batch_size (`bool`, *optional*, defaults to `False`) + Whether to find a batch size that will fit into memory automatically through exponential decay, avoiding + CUDA Out-of-Memory errors. Requires accelerate to be installed (`pip install accelerate`) + full_determinism (`bool`, *optional*, defaults to `False`) + If `True`, [`enable_full_determinism`] is called instead of [`set_seed`] to ensure reproducible results in + distributed training. Important: this will negatively impact the performance, so only use it for debugging. + torchdynamo (`str`, *optional*): + If set, the backend compiler for TorchDynamo. Possible choices are `"eager"`, `"aot_eager"`, `"inductor"`, + `"nvfuser"`, `"aot_nvfuser"`, `"aot_cudagraphs"`, `"ofi"`, `"fx2trt"`, `"onnxrt"` and `"ipex"`. + ray_scope (`str`, *optional*, defaults to `"last"`): + The scope to use when doing hyperparameter search with Ray. By default, `"last"` will be used. Ray will + then use the last checkpoint of all trials, compare those, and select the best one. However, other options + are also available. See the [Ray documentation]( + https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for + more options. + ddp_timeout (`int`, *optional*, defaults to 1800): + The timeout for `torch.distributed.init_process_group` calls, used to avoid GPU socket timeouts when + performing slow operations in distributed runnings. Please refer the [PyTorch documentation] + (https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) for more + information. + use_mps_device (`bool`, *optional*, defaults to `False`): + Whether to use Apple Silicon chip based `mps` device. + torch_compile (`bool`, *optional*, defaults to `False`): + Whether or not to compile the model using PyTorch 2.0 + [`torch.compile`](https://pytorch.org/get-started/pytorch-2.0/). + + This will use the best defaults for the [`torch.compile` + API](https://pytorch.org/docs/2.0/generated/torch.compile.html?highlight=torch+compile#torch.compile). You + can customize the defaults with the argument `torch_compile_backend` and `torch_compile_mode` but we don't + guarantee any of them will work as the support is progressively rolled in in PyTorch. + + This flag and the whole compile API is experimental and subject to change in future releases. + torch_compile_backend (`str`, *optional*): + The backend to use in `torch.compile`. If set to any value, `torch_compile` will be set to `True`. + + Refer to the PyTorch doc for possible values and note that they may change across PyTorch versions. + + This flag is experimental and subject to change in future releases. + torch_compile_mode (`str`, *optional*): + The mode to use in `torch.compile`. If set to any value, `torch_compile` will be set to `True`. + + Refer to the PyTorch doc for possible values and note that they may change across PyTorch versions. + + This flag is experimental and subject to change in future releases. + """ + + framework = "pt" + output_dir: str = field( + metadata={"help": "The output directory where the model predictions and checkpoints will be written."}, + ) + overwrite_output_dir: bool = field( + default=False, + metadata={ + "help": ( + "Overwrite the content of the output directory. " + "Use this to continue training if output_dir points to a checkpoint directory." + ) + }, + ) + + do_train: bool = field(default=False, metadata={"help": "Whether to run training."}) + do_eval: bool = field(default=False, metadata={"help": "Whether to run eval on the dev set."}) + do_predict: bool = field(default=False, metadata={"help": "Whether to run predictions on the test set."}) + evaluation_strategy: Union[IntervalStrategy, str] = field( + default="no", + metadata={"help": "The evaluation strategy to use."}, + ) + prediction_loss_only: bool = field( + default=False, + metadata={"help": "When performing evaluation and predictions, only returns the loss."}, + ) + + per_device_train_batch_size: int = field( + default=8, metadata={"help": "Batch size per GPU/TPU core/CPU for training."} + ) + per_device_eval_batch_size: int = field( + default=8, metadata={"help": "Batch size per GPU/TPU core/CPU for evaluation."} + ) + + per_gpu_train_batch_size: Optional[int] = field( + default=None, + metadata={ + "help": ( + "Deprecated, the use of `--per_device_train_batch_size` is preferred. " + "Batch size per GPU/TPU core/CPU for training." + ) + }, + ) + per_gpu_eval_batch_size: Optional[int] = field( + default=None, + metadata={ + "help": ( + "Deprecated, the use of `--per_device_eval_batch_size` is preferred. " + "Batch size per GPU/TPU core/CPU for evaluation." + ) + }, + ) + + gradient_accumulation_steps: int = field( + default=1, + metadata={"help": "Number of updates steps to accumulate before performing a backward/update pass."}, + ) + eval_accumulation_steps: Optional[int] = field( + default=None, + metadata={"help": "Number of predictions steps to accumulate before moving the tensors to the CPU."}, + ) + + eval_delay: Optional[float] = field( + default=0, + metadata={ + "help": ( + "Number of epochs or steps to wait for before the first evaluation can be performed, depending on the" + " evaluation_strategy." + ) + }, + ) + + learning_rate: float = field(default=5e-5, metadata={"help": "The initial learning rate for AdamW."}) + weight_decay: float = field(default=0.0, metadata={"help": "Weight decay for AdamW if we apply some."}) + adam_beta1: float = field(default=0.9, metadata={"help": "Beta1 for AdamW optimizer"}) + adam_beta2: float = field(default=0.999, metadata={"help": "Beta2 for AdamW optimizer"}) + adam_epsilon: float = field(default=1e-8, metadata={"help": "Epsilon for AdamW optimizer."}) + max_grad_norm: float = field(default=1.0, metadata={"help": "Max gradient norm."}) + + num_train_epochs: float = field(default=3.0, metadata={"help": "Total number of training epochs to perform."}) + max_steps: int = field( + default=-1, + metadata={"help": "If > 0: set total number of training steps to perform. Override num_train_epochs."}, + ) + lr_scheduler_type: Union[SchedulerType, str] = field( + default="linear", + metadata={"help": "The scheduler type to use."}, + ) + warmup_ratio: float = field( + default=0.0, metadata={"help": "Linear warmup over warmup_ratio fraction of total steps."} + ) + warmup_steps: int = field(default=0, metadata={"help": "Linear warmup over warmup_steps."}) + + log_level: Optional[str] = field( + default="passive", + metadata={ + "help": ( + "Logger log level to use on the main node. Possible choices are the log levels as strings: 'debug'," + " 'info', 'warning', 'error' and 'critical', plus a 'passive' level which doesn't set anything and" + " lets the application set the level. Defaults to 'passive'." + ), + "choices": trainer_log_levels.keys(), + }, + ) + log_level_replica: Optional[str] = field( + default="warning", + metadata={ + "help": "Logger log level to use on replica nodes. Same choices and defaults as ``log_level``", + "choices": trainer_log_levels.keys(), + }, + ) + log_on_each_node: bool = field( + default=True, + metadata={ + "help": ( + "When doing a multinode distributed training, whether to log once per node or just once on the main" + " node." + ) + }, + ) + logging_dir: Optional[str] = field(default=None, metadata={"help": "Tensorboard log dir."}) + logging_strategy: Union[IntervalStrategy, str] = field( + default="steps", + metadata={"help": "The logging strategy to use."}, + ) + logging_first_step: bool = field(default=False, metadata={"help": "Log the first global_step"}) + logging_steps: int = field(default=500, metadata={"help": "Log every X updates steps."}) + logging_nan_inf_filter: bool = field(default=True, metadata={"help": "Filter nan and inf losses for logging."}) + save_strategy: Union[IntervalStrategy, str] = field( + default="steps", + metadata={"help": "The checkpoint save strategy to use."}, + ) + save_steps: int = field(default=500, metadata={"help": "Save checkpoint every X updates steps."}) + save_total_limit: Optional[int] = field( + default=None, + metadata={ + "help": ( + "Limit the total amount of checkpoints. " + "Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints" + ) + }, + ) + save_safetensors: Optional[bool] = field( + default=False, + metadata={ + "help": "Use safetensors saving and loading for state dicts instead of default torch.load and torch.save." + }, + ) + save_on_each_node: bool = field( + default=False, + metadata={ + "help": ( + "When doing multi-node distributed training, whether to save models and checkpoints on each node, or" + " only on the main one" + ) + }, + ) + no_cuda: bool = field(default=False, metadata={"help": "Do not use CUDA even when it is available"}) + use_mps_device: bool = field( + default=False, metadata={"help": "Whether to use Apple Silicon chip based `mps` device."} + ) + seed: int = field(default=42, metadata={"help": "Random seed that will be set at the beginning of training."}) + data_seed: Optional[int] = field(default=None, metadata={"help": "Random seed to be used with data samplers."}) + jit_mode_eval: bool = field( + default=False, metadata={"help": "Whether or not to use PyTorch jit trace for inference"} + ) + use_ipex: bool = field( + default=False, + metadata={ + "help": ( + "Use Intel extension for PyTorch when it is available, installation:" + " 'https://github.com/intel/intel-extension-for-pytorch'" + ) + }, + ) + bf16: bool = field( + default=False, + metadata={ + "help": ( + "Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA" + " architecture or using CPU (no_cuda). This is an experimental API and it may change." + ) + }, + ) + fp16: bool = field( + default=False, + metadata={"help": "Whether to use fp16 (mixed) precision instead of 32-bit"}, + ) + fp16_opt_level: str = field( + default="O1", + metadata={ + "help": ( + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " + "See details at https://nvidia.github.io/apex/amp.html" + ) + }, + ) + half_precision_backend: str = field( + default="auto", + metadata={ + "help": "The backend to be used for half precision.", + "choices": ["auto", "cuda_amp", "apex", "cpu_amp"], + }, + ) + bf16_full_eval: bool = field( + default=False, + metadata={ + "help": ( + "Whether to use full bfloat16 evaluation instead of 32-bit. This is an experimental API and it may" + " change." + ) + }, + ) + fp16_full_eval: bool = field( + default=False, + metadata={"help": "Whether to use full float16 evaluation instead of 32-bit"}, + ) + tf32: Optional[bool] = field( + default=None, + metadata={ + "help": ( + "Whether to enable tf32 mode, available in Ampere and newer GPU architectures. This is an experimental" + " API and it may change." + ) + }, + ) + local_rank: int = field(default=-1, metadata={"help": "For distributed training: local_rank"}) + xpu_backend: Optional[str] = field( + default=None, + metadata={ + "help": "The backend to be used for distributed training on Intel XPU.", + "choices": ["mpi", "ccl", "gloo"], + }, + ) + tpu_num_cores: Optional[int] = field( + default=None, metadata={"help": "TPU: Number of TPU cores (automatically passed by launcher script)"} + ) + tpu_metrics_debug: bool = field( + default=False, + metadata={ + "help": ( + "Deprecated, the use of `--debug tpu_metrics_debug` is preferred. TPU: Whether to print debug metrics" + ) + }, + ) + debug: str = field( + default="", + metadata={ + "help": ( + "Whether or not to enable debug mode. Current options: " + "`underflow_overflow` (Detect underflow and overflow in activations and weights), " + "`tpu_metrics_debug` (print debug metrics on TPU)." + ) + }, + ) + + dataloader_drop_last: bool = field( + default=False, metadata={"help": "Drop the last incomplete batch if it is not divisible by the batch size."} + ) + eval_steps: Optional[int] = field(default=None, metadata={"help": "Run an evaluation every X steps."}) + dataloader_num_workers: int = field( + default=0, + metadata={ + "help": ( + "Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded" + " in the main process." + ) + }, + ) + + past_index: int = field( + default=-1, + metadata={"help": "If >=0, uses the corresponding part of the output as the past state for next step."}, + ) + + run_name: Optional[str] = field( + default=None, metadata={"help": "An optional descriptor for the run. Notably used for wandb logging."} + ) + disable_tqdm: Optional[bool] = field( + default=None, metadata={"help": "Whether or not to disable the tqdm progress bars."} + ) + + remove_unused_columns: Optional[bool] = field( + default=True, metadata={"help": "Remove columns not required by the model when using an nlp.Dataset."} + ) + label_names: Optional[List[str]] = field( + default=None, metadata={"help": "The list of keys in your dictionary of inputs that correspond to the labels."} + ) + + load_best_model_at_end: Optional[bool] = field( + default=False, + metadata={"help": "Whether or not to load the best model found during training at the end of training."}, + ) + metric_for_best_model: Optional[str] = field( + default=None, metadata={"help": "The metric to use to compare two different models."} + ) + greater_is_better: Optional[bool] = field( + default=None, metadata={"help": "Whether the `metric_for_best_model` should be maximized or not."} + ) + ignore_data_skip: bool = field( + default=False, + metadata={ + "help": ( + "When resuming training, whether or not to skip the first epochs and batches to get to the same" + " training data." + ) + }, + ) + sharded_ddp: str = field( + default="", + metadata={ + "help": ( + "Whether or not to use sharded DDP training (in distributed training only). The base option should be" + " `simple`, `zero_dp_2` or `zero_dp_3` and you can add CPU-offload to `zero_dp_2` or `zero_dp_3` like" + " this: zero_dp_2 offload` or `zero_dp_3 offload`. You can add auto-wrap to `zero_dp_2` or `zero_dp_3`" + " with the same syntax: zero_dp_2 auto_wrap` or `zero_dp_3 auto_wrap`." + ), + }, + ) + fsdp: str = field( + default="", + metadata={ + "help": ( + "Whether or not to use PyTorch Fully Sharded Data Parallel (FSDP) training (in distributed training" + " only). The base option should be `full_shard`, `shard_grad_op` or `no_shard` and you can add" + " CPU-offload to `full_shard` or `shard_grad_op` like this: full_shard offload` or `shard_grad_op" + " offload`. You can add auto-wrap to `full_shard` or `shard_grad_op` with the same syntax: full_shard" + " auto_wrap` or `shard_grad_op auto_wrap`." + ), + }, + ) + fsdp_min_num_params: int = field( + default=0, + metadata={ + "help": ( + "This parameter is deprecated. FSDP's minimum number of parameters for Default Auto Wrapping. (useful" + " only when `fsdp` field is passed)." + ) + }, + ) + fsdp_config: Optional[str] = field( + default=None, + metadata={ + "help": ( + "Config to be used with FSDP (Pytorch Fully Sharded Data Parallel). The value is either a" + "fsdp json config file (e.g., `fsdp_config.json`) or an already loaded json file as `dict`." + ) + }, + ) + fsdp_transformer_layer_cls_to_wrap: Optional[str] = field( + default=None, + metadata={ + "help": ( + "This parameter is deprecated. Transformer layer class name (case-sensitive) to wrap, e.g," + " `BertLayer`, `GPTJBlock`, `T5Block` .... (useful only when `fsdp` flag is passed)." + ) + }, + ) + deepspeed: Optional[str] = field( + default=None, + metadata={ + "help": ( + "Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already" + " loaded json file as a dict" + ) + }, + ) + label_smoothing_factor: float = field( + default=0.0, metadata={"help": "The label smoothing epsilon to apply (zero means no label smoothing)."} + ) + + default_optim = "adamw_hf" + # XXX: enable when pytorch==2.0.1 comes out - we want to give it time to get all the bugs sorted out + # if is_torch_available() and version.parse(version.parse(torch.__version__).base_version) >= version.parse("2.1.0"): + # default_optim = "adamw_torch_fused" + # and update the doc above to: + # optim (`str` or [`training_args.OptimizerNames`], *optional*, defaults to `"adamw_torch_fused"` (for torch<2.1.0 `"adamw_hf"`): + optim: Union[OptimizerNames, str] = field( + default=default_optim, + metadata={"help": "The optimizer to use."}, + ) + optim_args: Optional[str] = field(default=None, metadata={"help": "Optional arguments to supply to optimizer."}) + adafactor: bool = field(default=False, metadata={"help": "Whether or not to replace AdamW by Adafactor."}) + group_by_length: bool = field( + default=False, + metadata={"help": "Whether or not to group samples of roughly the same length together when batching."}, + ) + length_column_name: Optional[str] = field( + default="length", + metadata={"help": "Column name with precomputed lengths to use when grouping by length."}, + ) + report_to: Optional[List[str]] = field( + default=None, metadata={"help": "The list of integrations to report the results and logs to."} + ) + ddp_find_unused_parameters: Optional[bool] = field( + default=None, + metadata={ + "help": ( + "When using distributed training, the value of the flag `find_unused_parameters` passed to " + "`DistributedDataParallel`." + ) + }, + ) + ddp_bucket_cap_mb: Optional[int] = field( + default=None, + metadata={ + "help": ( + "When using distributed training, the value of the flag `bucket_cap_mb` passed to " + "`DistributedDataParallel`." + ) + }, + ) + dataloader_pin_memory: bool = field( + default=True, metadata={"help": "Whether or not to pin memory for DataLoader."} + ) + skip_memory_metrics: bool = field( + default=True, metadata={"help": "Whether or not to skip adding of memory profiler reports to metrics."} + ) + use_legacy_prediction_loop: bool = field( + default=False, metadata={"help": "Whether or not to use the legacy prediction_loop in the Trainer."} + ) + push_to_hub: bool = field( + default=False, metadata={"help": "Whether or not to upload the trained model to the model hub after training."} + ) + resume_from_checkpoint: Optional[str] = field( + default=None, + metadata={"help": "The path to a folder with a valid checkpoint for your model."}, + ) + hub_model_id: Optional[str] = field( + default=None, metadata={"help": "The name of the repository to keep in sync with the local `output_dir`."} + ) + hub_strategy: Union[HubStrategy, str] = field( + default="every_save", + metadata={"help": "The hub strategy to use when `--push_to_hub` is activated."}, + ) + hub_token: Optional[str] = field(default=None, metadata={"help": "The token to use to push to the Model Hub."}) + hub_private_repo: bool = field(default=False, metadata={"help": "Whether the model repository is private or not."}) + gradient_checkpointing: bool = field( + default=False, + metadata={ + "help": "If True, use gradient checkpointing to save memory at the expense of slower backward pass." + }, + ) + include_inputs_for_metrics: bool = field( + default=False, metadata={"help": "Whether or not the inputs will be passed to the `compute_metrics` function."} + ) + # Deprecated arguments + fp16_backend: str = field( + default="auto", + metadata={ + "help": "Deprecated. Use half_precision_backend instead", + "choices": ["auto", "cuda_amp", "apex", "cpu_amp"], + }, + ) + push_to_hub_model_id: Optional[str] = field( + default=None, metadata={"help": "The name of the repository to which push the `Trainer`."} + ) + push_to_hub_organization: Optional[str] = field( + default=None, metadata={"help": "The name of the organization in with to which push the `Trainer`."} + ) + push_to_hub_token: Optional[str] = field( + default=None, metadata={"help": "The token to use to push to the Model Hub."} + ) + _n_gpu: int = field(init=False, repr=False, default=-1) + mp_parameters: str = field( + default="", + metadata={"help": "Used by the SageMaker launcher to send mp-specific args. Ignored in Trainer"}, + ) + + auto_find_batch_size: bool = field( + default=False, + metadata={ + "help": ( + "Whether to automatically decrease the batch size in half and rerun the training loop again each time" + " a CUDA Out-of-Memory was reached" + ) + }, + ) + full_determinism: bool = field( + default=False, + metadata={ + "help": ( + "Whether to call enable_full_determinism instead of set_seed for reproducibility in distributed" + " training. Important: this will negatively impact the performance, so only use it for debugging." + ) + }, + ) + torchdynamo: Optional[str] = field( + default=None, + metadata={ + "help": "This argument is deprecated, use `--torch_compile_backend` instead.", + }, + ) + ray_scope: Optional[str] = field( + default="last", + metadata={ + "help": ( + 'The scope to use when doing hyperparameter search with Ray. By default, `"last"` will be used. Ray' + " will then use the last checkpoint of all trials, compare those, and select the best one. However," + " other options are also available. See the Ray documentation" + " (https://docs.ray.io/en/latest/tune/api_docs/analysis.html" + "#ray.tune.ExperimentAnalysis.get_best_trial)" + " for more options." + ) + }, + ) + ddp_timeout: Optional[int] = field( + default=1800, + metadata={ + "help": "Overrides the default timeout for distributed training (value should be given in seconds)." + }, + ) + torch_compile: bool = field( + default=False, metadata={"help": "If set to `True`, the model will be wrapped in `torch.compile`."} + ) + torch_compile_backend: Optional[str] = field( + default=None, + metadata={ + "help": "Which backend to use with `torch.compile`, passing one will trigger a model compilation.", + }, + ) + torch_compile_mode: Optional[str] = field( + default=None, + metadata={ + "help": "Which mode to use with `torch.compile`, passing one will trigger a model compilation.", + }, + ) + + def __post_init__(self): + # Handle --use_env option in torch.distributed.launch (local_rank not passed as an arg then). + # This needs to happen before any call to self.device or self.n_gpu. + env_local_rank = int(os.environ.get("LOCAL_RANK", -1)) + if env_local_rank != -1 and env_local_rank != self.local_rank: + self.local_rank = env_local_rank + + # expand paths, if not os.makedirs("~/bar") will make directory + # in the current directory instead of the actual home + # see https://github.com/huggingface/transformers/issues/10628 + if self.output_dir is not None: + self.output_dir = os.path.expanduser(self.output_dir) + if self.logging_dir is None and self.output_dir is not None: + self.logging_dir = os.path.join(self.output_dir, default_logdir()) + if self.logging_dir is not None: + self.logging_dir = os.path.expanduser(self.logging_dir) + + if self.disable_tqdm is None: + self.disable_tqdm = logger.getEffectiveLevel() > logging.WARN + + if isinstance(self.evaluation_strategy, EvaluationStrategy): + warnings.warn( + "using `EvaluationStrategy` for `evaluation_strategy` is deprecated and will be removed in version 5" + " of 🤗 Transformers. Use `IntervalStrategy` instead", + FutureWarning, + ) + # Go back to the underlying string or we won't be able to instantiate `IntervalStrategy` on it. + self.evaluation_strategy = self.evaluation_strategy.value + + self.evaluation_strategy = IntervalStrategy(self.evaluation_strategy) + self.logging_strategy = IntervalStrategy(self.logging_strategy) + self.save_strategy = IntervalStrategy(self.save_strategy) + self.hub_strategy = HubStrategy(self.hub_strategy) + + self.lr_scheduler_type = SchedulerType(self.lr_scheduler_type) + if self.do_eval is False and self.evaluation_strategy != IntervalStrategy.NO: + self.do_eval = True + + # eval_steps has to be defined and non-zero, fallbacks to logging_steps if the latter is non-zero + if self.evaluation_strategy == IntervalStrategy.STEPS and (self.eval_steps is None or self.eval_steps == 0): + if self.logging_steps > 0: + logger.info(f"using `logging_steps` to initialize `eval_steps` to {self.logging_steps}") + self.eval_steps = self.logging_steps + else: + raise ValueError( + f"evaluation strategy {self.evaluation_strategy} requires either non-zero --eval_steps or" + " --logging_steps" + ) + + # logging_steps must be non-zero for logging_strategy that is other than 'no' + if self.logging_strategy == IntervalStrategy.STEPS and self.logging_steps == 0: + raise ValueError(f"logging strategy {self.logging_strategy} requires non-zero --logging_steps") + + # Sanity checks for load_best_model_at_end: we require save and eval strategies to be compatible. + if self.load_best_model_at_end: + if self.evaluation_strategy != self.save_strategy: + raise ValueError( + "--load_best_model_at_end requires the save and eval strategy to match, but found\n- Evaluation " + f"strategy: {self.evaluation_strategy}\n- Save strategy: {self.save_strategy}" + ) + if self.evaluation_strategy == IntervalStrategy.STEPS and self.save_steps % self.eval_steps != 0: + raise ValueError( + "--load_best_model_at_end requires the saving steps to be a round multiple of the evaluation " + f"steps, but found {self.save_steps}, which is not a round multiple of {self.eval_steps}." + ) + + safetensors_available = is_safetensors_available() + if self.save_safetensors and not safetensors_available: + raise ValueError(f"--save_safetensors={self.save_safetensors} requires safetensors to be installed!") + if not self.save_safetensors and safetensors_available: + logger.info( + f"Found safetensors installation, but --save_safetensors={self.save_safetensors}. " + f"Safetensors should be a preferred weights saving format due to security and performance reasons. " + f"If your model cannot be saved by safetensors please feel free to open an issue at " + f"https://github.com/huggingface/safetensors!" + ) + + if self.load_best_model_at_end and self.metric_for_best_model is None: + self.metric_for_best_model = "loss" + if self.greater_is_better is None and self.metric_for_best_model is not None: + self.greater_is_better = self.metric_for_best_model not in ["loss", "eval_loss"] + if self.run_name is None: + self.run_name = self.output_dir + if self.framework == "pt" and is_torch_available(): + if self.fp16_backend and self.fp16_backend != "auto": + warnings.warn( + "`fp16_backend` is deprecated and will be removed in version 5 of 🤗 Transformers. Use" + " `half_precision_backend` instead", + FutureWarning, + ) + self.half_precision_backend = self.fp16_backend + + if self.bf16 or self.bf16_full_eval: + if self.no_cuda and not is_torch_bf16_cpu_available() and not is_torch_tpu_available(): + # cpu + raise ValueError("Your setup doesn't support bf16/(cpu, tpu, neuroncore). You need torch>=1.10") + + if self.fp16 and self.bf16: + raise ValueError("At most one of fp16 and bf16 can be True, but not both") + + if self.fp16_full_eval and self.bf16_full_eval: + raise ValueError("At most one of fp16 and bf16 can be True for full eval, but not both") + + if self.bf16: + if self.half_precision_backend == "apex": + raise ValueError( + " `--half_precision_backend apex`: GPU bf16 is not supported by apex. Use" + " `--half_precision_backend cuda_amp` instead" + ) + if not (self.sharded_ddp == "" or not self.sharded_ddp): + raise ValueError("sharded_ddp is not supported with bf16") + + self.optim = OptimizerNames(self.optim) + if self.adafactor: + warnings.warn( + "`--adafactor` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--optim" + " adafactor` instead", + FutureWarning, + ) + self.optim = OptimizerNames.ADAFACTOR + if self.optim == OptimizerNames.ADAMW_TORCH_FUSED and is_torch_available(): + if version.parse(version.parse(torch.__version__).base_version) < version.parse("2.0.0"): + raise ValueError("--optim adamw_torch_fused requires PyTorch 2.0 or higher") + # there is a bug in fp16/AMP in pt-2.0.0 + if version.parse(version.parse(torch.__version__).base_version) == version.parse("2.0.0") and self.fp16: + raise ValueError("--optim adamw_torch_fused with --fp16 requires PyTorch>2.0") + + if ( + self.framework == "pt" + and is_torch_available() + and (self.device.type != "cuda") + and (get_xla_device_type(self.device) != "GPU") + and (get_xla_device_type(self.device) != "TPU") + and (self.device.type != "cpu") + and (self.bf16 or self.bf16_full_eval) + ): + raise ValueError( + "BF16 Mixed precision training with AMP (`--bf16`) and BF16 half precision evaluation" + " (`--bf16_full_eval`) can only be used on CUDA or CPU/TPU/NeuronCore devices." + ) + + if self.torchdynamo is not None: + warnings.warn( + "`torchdynamo` is deprecated and will be removed in version 5 of 🤗 Transformers. Use" + " `torch_compile_backend` instead", + FutureWarning, + ) + self.torch_compile_backend = self.torchdynamo + if (self.torch_compile_mode is not None or self.torch_compile_backend is not None) and not self.torch_compile: + self.torch_compile = True + if self.torch_compile and self.torch_compile_backend is None: + self.torch_compile_backend = "inductor" + if self.framework == "pt" and is_torch_available() and self.torch_compile: + if is_torch_tf32_available(): + if self.tf32 is None and not self.fp16 or self.bf16: + logger.info( + "Setting TF32 in CUDA backends to speedup torch compile, you won't see any improvement" + " otherwise." + ) + torch.backends.cuda.matmul.allow_tf32 = True + else: + logger.warning( + "The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here." + ) + if self.framework == "pt" and is_torch_available() and self.tf32 is not None: + if self.tf32: + if is_torch_tf32_available(): + torch.backends.cuda.matmul.allow_tf32 = True + else: + raise ValueError("--tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7") + else: + if is_torch_tf32_available(): + torch.backends.cuda.matmul.allow_tf32 = False + # no need to assert on else + + if self.report_to is None: + logger.info( + "The default value for the training argument `--report_to` will change in v5 (from all installed " + "integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as " + "now. You should start updating your code and make this info disappear :-)." + ) + self.report_to = "all" + if self.report_to == "all" or self.report_to == ["all"]: + # Import at runtime to avoid a circular import. + from .integrations import get_available_reporting_integrations + + self.report_to = get_available_reporting_integrations() + elif self.report_to == "none" or self.report_to == ["none"]: + self.report_to = [] + elif not isinstance(self.report_to, list): + self.report_to = [self.report_to] + + if self.warmup_ratio < 0 or self.warmup_ratio > 1: + raise ValueError("warmup_ratio must lie in range [0,1]") + elif self.warmup_ratio > 0 and self.warmup_steps > 0: + logger.info( + "Both warmup_ratio and warmup_steps given, warmup_steps will override any effect of warmup_ratio" + " during training" + ) + + if isinstance(self.sharded_ddp, bool): + self.sharded_ddp = "simple" if self.sharded_ddp else "" + if isinstance(self.sharded_ddp, str): + self.sharded_ddp = [ShardedDDPOption(s) for s in self.sharded_ddp.split()] + if self.sharded_ddp == [ShardedDDPOption.OFFLOAD]: + raise ValueError( + "`--sharded_ddp offload` can't work on its own. It needs to be added to `--sharded_ddp zero_dp_2` or " + '`--sharded_ddp zero_dp_3`. For example, `--sharded_ddp "zero_dp_2 offload"`.' + ) + elif len(self.sharded_ddp) > 1 and ShardedDDPOption.SIMPLE in self.sharded_ddp: + raise ValueError("`--sharded_ddp simple` is not compatible with any other option.") + elif ShardedDDPOption.ZERO_DP_2 in self.sharded_ddp and ShardedDDPOption.ZERO_DP_3 in self.sharded_ddp: + raise ValueError("`--sharded_ddp zero_dp_2` is not compatible with `--sharded_ddp zero_dp_3`.") + + if isinstance(self.fsdp, bool): + self.fsdp = "full_shard" if self.fsdp else "" + if isinstance(self.fsdp, str): + self.fsdp = [FSDPOption(s) for s in self.fsdp.split()] + if self.fsdp == [FSDPOption.OFFLOAD]: + raise ValueError( + "`--fsdp offload` can't work on its own. It needs to be added to `--fsdp full_shard` or " + '`--fsdp shard_grad_op`. For example, `--fsdp "full_shard offload"`.' + ) + elif FSDPOption.FULL_SHARD in self.fsdp and FSDPOption.SHARD_GRAD_OP in self.fsdp: + raise ValueError("`--fsdp full_shard` is not compatible with `--fsdp shard_grad_op`.") + + if self.fsdp_config is None: + self.fsdp_config = {} + + if isinstance(self.fsdp_config, str): + with io.open(self.fsdp_config, "r", encoding="utf-8") as f: + self.fsdp_config = json.load(f) + + if self.fsdp_min_num_params > 0: + warnings.warn("using `--fsdp_min_num_params` is deprecated. Use fsdp_config instead ", FutureWarning) + + self.fsdp_config["fsdp_min_num_params"] = max( + self.fsdp_config.get("fsdp_min_num_params", 0), self.fsdp_min_num_params + ) + + # if fsdp_config["fsdp_transformer_layer_cls_to_wrap"] is specified as a string, convert it to a list with a single object + if isinstance(self.fsdp_config.get("fsdp_transformer_layer_cls_to_wrap", None), str): + self.fsdp_config["fsdp_transformer_layer_cls_to_wrap"] = [ + self.fsdp_config["fsdp_transformer_layer_cls_to_wrap"] + ] + + if self.fsdp_transformer_layer_cls_to_wrap is not None: + warnings.warn( + "using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead ", FutureWarning + ) + self.fsdp_config["fsdp_transformer_layer_cls_to_wrap"] = self.fsdp_config.get( + "fsdp_transformer_layer_cls_to_wrap", [] + ) + [self.fsdp_transformer_layer_cls_to_wrap] + + if len(self.fsdp) == 0 and self.fsdp_config["fsdp_min_num_params"] > 0: + warnings.warn("`--fsdp_min_num_params` is useful only when `--fsdp` is specified.") + + if len(self.fsdp) == 0 and self.fsdp_config.get("fsdp_transformer_layer_cls_to_wrap", None) is not None: + warnings.warn("`--fsdp_transformer_layer_cls_to_wrap` is useful only when `--fsdp` is specified.") + + if ( + len(self.fsdp) > 0 + and self.fsdp_config["fsdp_min_num_params"] > 0 + and self.fsdp_config.get("fsdp_transformer_layer_cls_to_wrap", None) is not None + ): + raise ValueError( + "`--fsdp_min_num_params` and `--fsdp_transformer_layer_cls_to_wrap` are mutually exclusive." + ) + self.fsdp_config["xla"] = self.fsdp_config.get("xla", False) + self.fsdp_config["xla_fsdp_grad_ckpt"] = self.fsdp_config.get("xla_fsdp_grad_ckpt", False) + if self.fsdp_config["xla"]: + if len(self.fsdp) > 0: + # store XLA fsdp configuration parameters into a dictionary + self.xla_fsdp_config = self.fsdp_config.get("xla_fsdp_settings", {}) + # apply appropriate string to torch.dtype conversions for parameters + if "compute_dtype" in self.xla_fsdp_config: + self.xla_fsdp_config["compute_dtype"] = getattr(torch, self.xla_fsdp_config["compute_dtype"]) + if "buffer_dtype" in self.xla_fsdp_config: + self.xla_fsdp_config["buffer_dtype"] = getattr(torch, self.xla_fsdp_config["buffer_dtype"]) + else: + warnings.warn("XLA FSDP can be used only when `--fsdp` is specified.") + else: + if self.fsdp_config["xla_fsdp_grad_ckpt"]: + warnings.warn("`--xla_fsdp_grad_ckpt` is useful only when `--xla` is set to true.") + + if self.tpu_metrics_debug: + warnings.warn( + "using `--tpu_metrics_debug` is deprecated and will be removed in version 5 of 🤗 Transformers. Use" + " `--debug tpu_metrics_debug` instead", + FutureWarning, + ) + self.debug += " tpu_metrics_debug" + self.tpu_metrics_debug = False + if isinstance(self.debug, str): + self.debug = [DebugOption(s) for s in self.debug.split()] + + if self.deepspeed: + # - must be run very last in arg parsing, since it will use a lot of these settings. + # - must be run before the model is created. + if not is_accelerate_available(): + raise ValueError("--deepspeed requires Accelerate to be installed: `pip install accelerate`.") + from transformers.deepspeed import HfTrainerDeepSpeedConfig + + # will be used later by the Trainer + # note: leave self.deepspeed unmodified in case a user relies on it not to be modified) + self.hf_deepspeed_config = HfTrainerDeepSpeedConfig(self.deepspeed) + self.hf_deepspeed_config.trainer_config_process(self) + + if self.push_to_hub_token is not None: + warnings.warn( + "`--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use " + "`--hub_token` instead.", + FutureWarning, + ) + self.hub_token = self.push_to_hub_token + + if self.push_to_hub_model_id is not None: + self.hub_model_id = get_full_repo_name( + self.push_to_hub_model_id, organization=self.push_to_hub_organization, token=self.hub_token + ) + if self.push_to_hub_organization is not None: + warnings.warn( + "`--push_to_hub_model_id` and `--push_to_hub_organization` are deprecated and will be removed in " + "version 5 of 🤗 Transformers. Use `--hub_model_id` instead and pass the full repo name to this " + f"argument (in this case {self.hub_model_id}).", + FutureWarning, + ) + else: + warnings.warn( + "`--push_to_hub_model_id` is deprecated and will be removed in version 5 of 🤗 Transformers. Use " + "`--hub_model_id` instead and pass the full repo name to this argument (in this case " + f"{self.hub_model_id}).", + FutureWarning, + ) + elif self.push_to_hub_organization is not None: + self.hub_model_id = f"{self.push_to_hub_organization}/{Path(self.output_dir).name}" + warnings.warn( + "`--push_to_hub_organization` is deprecated and will be removed in version 5 of 🤗 Transformers. Use " + "`--hub_model_id` instead and pass the full repo name to this argument (in this case " + f"{self.hub_model_id}).", + FutureWarning, + ) + + def __str__(self): + self_as_dict = asdict(self) + + # Remove deprecated arguments. That code should be removed once + # those deprecated arguments are removed from TrainingArguments. (TODO: v5) + del self_as_dict["per_gpu_train_batch_size"] + del self_as_dict["per_gpu_eval_batch_size"] + + self_as_dict = {k: f"<{k.upper()}>" if k.endswith("_token") else v for k, v in self_as_dict.items()} + + attrs_as_str = [f"{k}={v},\n" for k, v in sorted(self_as_dict.items())] + return f"{self.__class__.__name__}(\n{''.join(attrs_as_str)})" + + __repr__ = __str__ + + @property + def train_batch_size(self) -> int: + """ + The actual batch size for training (may differ from `per_gpu_train_batch_size` in distributed training). + """ + if self.per_gpu_train_batch_size: + logger.warning( + "Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future " + "version. Using `--per_device_train_batch_size` is preferred." + ) + per_device_batch_size = self.per_gpu_train_batch_size or self.per_device_train_batch_size + train_batch_size = per_device_batch_size * max(1, self.n_gpu) + return train_batch_size + + @property + def eval_batch_size(self) -> int: + """ + The actual batch size for evaluation (may differ from `per_gpu_eval_batch_size` in distributed training). + """ + if self.per_gpu_eval_batch_size: + logger.warning( + "Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future " + "version. Using `--per_device_eval_batch_size` is preferred." + ) + per_device_batch_size = self.per_gpu_eval_batch_size or self.per_device_eval_batch_size + eval_batch_size = per_device_batch_size * max(1, self.n_gpu) + return eval_batch_size + + @property + def ddp_timeout_delta(self) -> timedelta: + """ + The actual timeout for torch.distributed.init_process_group since it expects a timedelta variable. + """ + return timedelta(seconds=self.ddp_timeout) + + @cached_property + def _setup_devices(self) -> "torch.device": + requires_backends(self, ["torch"]) + logger.info("PyTorch: setting up devices") + if torch.distributed.is_available() and torch.distributed.is_initialized() and self.local_rank == -1: + logger.warning( + "torch.distributed process group is initialized, but local_rank == -1. " + "In order to use Torch DDP, launch your script with `python -m torch.distributed.launch" + ) + if self.no_cuda: + device = torch.device("cpu") + self._n_gpu = 0 + self.local_rank = get_int_from_env( + ["LOCAL_RANK", "MPI_LOCALRANKID", "OMPI_COMM_WORLD_LOCAL_RANK", "MV2_COMM_WORLD_LOCAL_RANK"], + self.local_rank, + ) + if self.local_rank != -1 and not torch.distributed.is_initialized(): + # Initializes distributed backend for cpu + if self.xpu_backend not in ("mpi", "ccl", "gloo"): + raise ValueError( + "CPU distributed training backend is not properly set. " + "Please set '--xpu_backend' to either 'mpi' or 'ccl' or 'gloo'." + ) + if self.xpu_backend == "ccl": + requires_backends(self, "oneccl_bind_pt") + if ccl_version >= "1.12": + import oneccl_bindings_for_pytorch # noqa: F401 + else: + import torch_ccl # noqa: F401 + if int(os.environ.get("CCL_WORKER_COUNT", 0)) < 1: + raise ValueError( + "CPU distributed training backend is ccl. but CCL_WORKER_COUNT is not correctly set. " + "Please use like 'export CCL_WORKER_COUNT = 1' to set." + ) + + # Try to get launch configuration from environment variables set by MPI launcher - works for Intel MPI, OpenMPI and MVAPICH + rank = get_int_from_env(["RANK", "PMI_RANK", "OMPI_COMM_WORLD_RANK", "MV2_COMM_WORLD_RANK"], 0) + size = get_int_from_env(["WORLD_SIZE", "PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE"], 1) + local_size = get_int_from_env( + ["MPI_LOCALNRANKS", "OMPI_COMM_WORLD_LOCAL_SIZE", "MV2_COMM_WORLD_LOCAL_SIZE"], 1 + ) + os.environ["RANK"] = str(rank) + os.environ["WORLD_SIZE"] = str(size) + os.environ["LOCAL_RANK"] = str(self.local_rank) + if not os.environ.get("MASTER_PORT", None): + os.environ["MASTER_PORT"] = "29500" + if not os.environ.get("MASTER_ADDR", None): + if local_size != size or self.xpu_backend != "mpi": + raise ValueError( + "Looks like distributed multinode run but MASTER_ADDR env not set, " + "please try exporting rank 0's hostname as MASTER_ADDR" + ) + if ( + torch.get_num_threads() == 1 + and get_int_from_env(["OMP_NUM_THREADS", "MKL_NUM_THREADS"], 0) == 0 + and is_psutil_available() + ): + import psutil + + num_cpu_threads_per_process = int(psutil.cpu_count(logical=False) / local_size) + if num_cpu_threads_per_process == 0: + num_cpu_threads_per_process = 1 + torch.set_num_threads(num_cpu_threads_per_process) + logger.info( + f"num_cpu_threads_per_process unset, we set it at {num_cpu_threads_per_process} to improve oob" + " performance." + ) + torch.distributed.init_process_group( + backend=self.xpu_backend, rank=rank, world_size=size, timeout=self.ddp_timeout_delta + ) + elif is_torch_tpu_available(): + device = xm.xla_device() + self._n_gpu = 0 + elif is_sagemaker_mp_enabled(): + local_rank = smp.local_rank() + device = torch.device("cuda", local_rank) + self._n_gpu = 1 + elif is_sagemaker_dp_enabled(): + import smdistributed.dataparallel.torch.torch_smddp # noqa: F401 + + dist.init_process_group(backend="smddp", timeout=self.ddp_timeout_delta) + self.local_rank = int(os.getenv("SMDATAPARALLEL_LOCAL_RANK")) + device = torch.device("cuda", self.local_rank) + self._n_gpu = 1 + elif self.deepspeed: + # deepspeed inits torch.distributed internally + from .deepspeed import is_deepspeed_available + + if not is_deepspeed_available(): + raise ImportError("--deepspeed requires deepspeed: `pip install deepspeed`.") + import deepspeed + + deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout)) + + # workaround for setups like notebooks where the launcher can't be used, + # but deepspeed requires a dist env. + # env LOCAL_RANK could be set manually by the user, or via init_distributed if mpi4py is installed + self.local_rank = int(os.environ.get("LOCAL_RANK", "-1")) + + device = torch.device("cuda", self.local_rank) + self._n_gpu = 1 + elif self.local_rank == -1: + if self.use_mps_device: + if not torch.backends.mps.is_available(): + if not torch.backends.mps.is_built(): + raise AssertionError( + "MPS not available because the current PyTorch install was not " + "built with MPS enabled. Please install torch version >=1.12.0 on " + "your Apple silicon Mac running macOS 12.3 or later with a native " + "version (arm64) of Python" + ) + else: + raise AssertionError( + "MPS not available because the current MacOS version is not 12.3+ " + "and/or you do not have an MPS-enabled device on this machine." + ) + else: + if not version.parse(version.parse(torch.__version__).base_version) > version.parse("1.12.0"): + warnings.warn( + "We strongly recommend to install PyTorch >= 1.13 (nightly version at the time of writing)" + " on your MacOS machine. It has major fixes related to model correctness and performance" + " improvements for transformer based models. Please refer to" + " https://github.com/pytorch/pytorch/issues/82707 for more details." + ) + device = torch.device("mps") + self._n_gpu = 1 + + else: + # if n_gpu is > 1 we'll use nn.DataParallel. + # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0` + # Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will + # trigger an error that a device index is missing. Index 0 takes into account the + # GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0` + # will use the first GPU in that env, i.e. GPU#1 + device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + # Sometimes the line in the postinit has not been run before we end up here, so just checking we're not at + # the default value. + self._n_gpu = torch.cuda.device_count() + else: + # Here, we'll use torch.distributed. + # Initializes the distributed backend which will take care of synchronizing nodes/GPUs + if not torch.distributed.is_initialized(): + if self.xpu_backend and self.xpu_backend in ("mpi", "gloo"): + torch.distributed.init_process_group(backend=self.xpu_backend, timeout=self.ddp_timeout_delta) + else: + torch.distributed.init_process_group(backend="nccl", timeout=self.ddp_timeout_delta) + device = torch.device("cuda", self.local_rank) + self._n_gpu = 1 + + if device.type == "cuda": + torch.cuda.set_device(device) + + return device + + @property + def device(self) -> "torch.device": + """ + The device used by this process. + """ + requires_backends(self, ["torch"]) + return self._setup_devices + + @property + def n_gpu(self): + """ + The number of GPUs used by this process. + + Note: + This will only be greater than one when you have multiple GPUs available but are not using distributed + training. For distributed training, it will always be 1. + """ + requires_backends(self, ["torch"]) + # Make sure `self._n_gpu` is properly setup. + _ = self._setup_devices + return self._n_gpu + + @property + def parallel_mode(self): + """ + The current mode used for parallelism if multiple GPUs/TPU cores are available. One of: + + - `ParallelMode.NOT_PARALLEL`: no parallelism (CPU or one GPU). + - `ParallelMode.NOT_DISTRIBUTED`: several GPUs in one single process (uses `torch.nn.DataParallel`). + - `ParallelMode.DISTRIBUTED`: several GPUs, each having its own process (uses + `torch.nn.DistributedDataParallel`). + - `ParallelMode.TPU`: several TPU cores. + """ + requires_backends(self, ["torch"]) + if is_torch_tpu_available(): + return ParallelMode.TPU + elif is_sagemaker_mp_enabled(): + return ParallelMode.SAGEMAKER_MODEL_PARALLEL + elif is_sagemaker_dp_enabled(): + return ParallelMode.SAGEMAKER_DATA_PARALLEL + elif self.local_rank != -1: + return ParallelMode.DISTRIBUTED + elif self.n_gpu > 1: + return ParallelMode.NOT_DISTRIBUTED + else: + return ParallelMode.NOT_PARALLEL + + @property + def world_size(self): + """ + The number of processes used in parallel. + """ + requires_backends(self, ["torch"]) + + if is_torch_tpu_available(): + return xm.xrt_world_size() + elif is_sagemaker_mp_enabled(): + return smp.dp_size() if not smp.state.cfg.prescaled_batch else smp.rdp_size() + elif is_sagemaker_dp_enabled(): + return dist.get_world_size() + elif self.local_rank != -1: + return torch.distributed.get_world_size() + return 1 + + @property + def process_index(self): + """ + The index of the current process used. + """ + requires_backends(self, ["torch"]) + if is_torch_tpu_available(): + return xm.get_ordinal() + elif is_sagemaker_mp_enabled(): + return smp.dp_rank() if not smp.state.cfg.prescaled_batch else smp.rdp_rank() + elif is_sagemaker_dp_enabled(): + return dist.get_rank() + elif self.local_rank != -1: + return torch.distributed.get_rank() + return 0 + + @property + def local_process_index(self): + """ + The index of the local process used. + """ + requires_backends(self, ["torch"]) + if is_torch_tpu_available(): + return xm.get_local_ordinal() + elif is_sagemaker_mp_enabled(): + return smp.local_rank() + elif is_sagemaker_dp_enabled(): + return dist.get_rank() + elif self.local_rank != -1: + return self.local_rank + return 0 + + @property + def should_log(self): + """ + Whether or not the current process should produce log. + """ + if self.log_on_each_node: + return self.local_process_index == 0 + else: + if is_sagemaker_mp_enabled(): + return smp.rank() == 0 + else: + return self.process_index == 0 + + @property + def should_save(self): + """ + Whether or not the current process should write to disk, e.g., to save models and checkpoints. + """ + if self.save_on_each_node: + return self.local_process_index == 0 + else: + if is_sagemaker_mp_enabled(): + return smp.rank() == 0 + else: + return self.process_index == 0 + + def get_process_log_level(self): + """ + Returns the log level to be used depending on whether this process is the main process of node 0, main process + of node non-0, or a non-main process. + + For the main process the log level defaults to the logging level set (`logging.WARNING` if you didn't do + anything) unless overridden by `log_level` argument. + + For the replica processes the log level defaults to `logging.WARNING` unless overridden by `log_level_replica` + argument. + + The choice between the main and replica process settings is made according to the return value of `should_log`. + """ + + # convert to int + log_level = trainer_log_levels[self.log_level] + log_level_replica = trainer_log_levels[self.log_level_replica] + + log_level_main_node = logging.get_verbosity() if log_level == -1 else log_level + log_level_replica_node = logging.get_verbosity() if log_level_replica == -1 else log_level_replica + return log_level_main_node if self.should_log else log_level_replica_node + + @property + def place_model_on_device(self): + """ + Can be subclassed and overridden for some specific integrations. + """ + return not is_sagemaker_mp_enabled() + + @property + def _no_sync_in_gradient_accumulation(self): + """ + Whether or not to use no_sync for the gradients when doing gradient accumulation. + """ + return not ( + self.deepspeed or is_sagemaker_dp_enabled() or is_sagemaker_mp_enabled() or is_torch_neuroncore_available() + ) + + @contextlib.contextmanager + def main_process_first(self, local=True, desc="work"): + """ + A context manager for torch distributed environment where on needs to do something on the main process, while + blocking replicas, and when it's finished releasing the replicas. + + One such use is for `datasets`'s `map` feature which to be efficient should be run once on the main process, + which upon completion saves a cached version of results and which then automatically gets loaded by the + replicas. + + Args: + local (`bool`, *optional*, defaults to `True`): + if `True` first means process of rank 0 of each node if `False` first means process of rank 0 of node + rank 0 In multi-node environment with a shared filesystem you most likely will want to use + `local=False` so that only the main process of the first node will do the processing. If however, the + filesystem is not shared, then the main process of each node will need to do the processing, which is + the default behavior. + desc (`str`, *optional*, defaults to `"work"`): + a work description to be used in debug logs + + """ + if is_torch_available() and self.world_size > 1: + main_process_desc = "main process" + if local: + is_main_process = self.local_process_index == 0 + main_process_desc = "main local process" + elif is_sagemaker_mp_enabled(): + is_main_process = smp.rank() == 0 + else: + is_main_process = self.process_index == 0 + + try: + if not is_main_process: + # tell all replicas to wait + logger.debug(f"{self.process_index}: waiting for the {main_process_desc} to perform {desc}") + if is_torch_tpu_available(): + xm.rendezvous(desc) + elif is_sagemaker_dp_enabled(): + dist.barrier() + else: + torch.distributed.barrier() + yield + finally: + if is_main_process: + # the wait is over + logger.debug(f"{self.process_index}: {main_process_desc} completed {desc}, releasing all replicas") + if is_torch_tpu_available(): + xm.rendezvous(desc) + elif is_sagemaker_dp_enabled(): + dist.barrier() + else: + torch.distributed.barrier() + else: + yield + + def get_warmup_steps(self, num_training_steps: int): + """ + Get number of steps used for a linear warmup. + """ + warmup_steps = ( + self.warmup_steps if self.warmup_steps > 0 else math.ceil(num_training_steps * self.warmup_ratio) + ) + return warmup_steps + + def to_dict(self): + """ + Serializes this instance while replace `Enum` by their values (for JSON serialization support). It obfuscates + the token values by removing their value. + """ + # filter out fields that are defined as field(init=False) + d = {field.name: getattr(self, field.name) for field in fields(self) if field.init} + + for k, v in d.items(): + if isinstance(v, Enum): + d[k] = v.value + if isinstance(v, list) and len(v) > 0 and isinstance(v[0], Enum): + d[k] = [x.value for x in v] + if k.endswith("_token"): + d[k] = f"<{k.upper()}>" + return d + + def to_json_string(self): + """ + Serializes this instance to a JSON string. + """ + return json.dumps(self.to_dict(), indent=2) + + def to_sanitized_dict(self) -> Dict[str, Any]: + """ + Sanitized serialization to use with TensorBoard’s hparams + """ + d = self.to_dict() + d = {**d, **{"train_batch_size": self.train_batch_size, "eval_batch_size": self.eval_batch_size}} + + valid_types = [bool, int, float, str] + if is_torch_available(): + valid_types.append(torch.Tensor) + + return {k: v if type(v) in valid_types else str(v) for k, v in d.items()} + + # The following methods are there to simplify the instantiation of `TrainingArguments` + def set_training( + self, + learning_rate: float = 5e-5, + batch_size: int = 8, + weight_decay: float = 0, + num_epochs: float = 3, + max_steps: int = -1, + gradient_accumulation_steps: int = 1, + seed: int = 42, + gradient_checkpointing: bool = False, + ): + """ + A method that regroups all basic arguments linked to the training. + + + + Calling this method will automatically set `self.do_train` to `True`. + + + + Args: + learning_rate (`float`, *optional*, defaults to 5e-5): + The initial learning rate for the optimizer. + batch_size (`int` *optional*, defaults to 8): + The batch size per device (GPU/TPU core/CPU...) used for training. + weight_decay (`float`, *optional*, defaults to 0): + The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the + optimizer. + num_train_epochs(`float`, *optional*, defaults to 3.0): + Total number of training epochs to perform (if not an integer, will perform the decimal part percents + of the last epoch before stopping training). + max_steps (`int`, *optional*, defaults to -1): + If set to a positive number, the total number of training steps to perform. Overrides + `num_train_epochs`. In case of using a finite iterable dataset the training may stop before reaching + the set number of steps when all data is exhausted. + gradient_accumulation_steps (`int`, *optional*, defaults to 1): + Number of updates steps to accumulate the gradients for, before performing a backward/update pass. + + + + When using gradient accumulation, one step is counted as one step with backward pass. Therefore, + logging, evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training + examples. + + + + seed (`int`, *optional*, defaults to 42): + Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use + the [`~Trainer.model_init`] function to instantiate the model if it has some randomly initialized + parameters. + gradient_checkpointing (`bool`, *optional*, defaults to `False`): + If True, use gradient checkpointing to save memory at the expense of slower backward pass. + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_training(learning_rate=1e-4, batch_size=32) + >>> args.learning_rate + 1e-4 + ``` + """ + self.do_train = True + self.learning_rate = learning_rate + self.per_device_train_batch_size = batch_size + self.weight_decay = weight_decay + self.num_train_epochs = num_epochs + self.max_steps = max_steps + self.gradient_accumulation_steps = gradient_accumulation_steps + self.seed = seed + self.gradient_checkpointing = gradient_checkpointing + return self + + def set_evaluate( + self, + strategy: Union[str, IntervalStrategy] = "no", + steps: int = 500, + batch_size: int = 8, + accumulation_steps: Optional[int] = None, + delay: Optional[float] = None, + loss_only: bool = False, + jit_mode: bool = False, + ): + """ + A method that regroups all arguments linked to the evaluation. + + Args: + strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"no"`): + The evaluation strategy to adopt during training. Possible values are: + + - `"no"`: No evaluation is done during training. + - `"steps"`: Evaluation is done (and logged) every `steps`. + - `"epoch"`: Evaluation is done at the end of each epoch. + + Setting a `strategy` different from `"no"` will set `self.do_eval` to `True`. + steps (`int`, *optional*, defaults to 500): + Number of update steps between two evaluations if `strategy="steps"`. + batch_size (`int` *optional*, defaults to 8): + The batch size per device (GPU/TPU core/CPU...) used for evaluation. + accumulation_steps (`int`, *optional*): + Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. + If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster + but requires more memory). + delay (`float`, *optional*): + Number of epochs or steps to wait for before the first evaluation can be performed, depending on the + evaluation_strategy. + loss_only (`bool`, *optional*, defaults to `False`): + Ignores all outputs except the loss. + jit_mode (`bool`, *optional*): + Whether or not to use PyTorch jit trace for inference. + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_evaluate(strategy="steps", steps=100) + >>> args.eval_steps + 100 + ``` + """ + self.evaluation_strategy = IntervalStrategy(strategy) + if self.evaluation_strategy == IntervalStrategy.STEPS and steps == 0: + raise ValueError("Setting `strategy` as 'steps' requires a positive value for `steps`.") + self.do_eval = self.evaluation_strategy != IntervalStrategy.NO + self.eval_steps = steps + self.per_device_eval_batch_size = batch_size + self.eval_accumulation_steps = accumulation_steps + self.eval_delay = delay + self.prediction_loss_only = loss_only + self.jit_mode_eval = jit_mode + return self + + def set_testing( + self, + batch_size: int = 8, + loss_only: bool = False, + jit_mode: bool = False, + ): + """ + A method that regroups all basic arguments linked to testing on a held-out dataset. + + + + Calling this method will automatically set `self.do_predict` to `True`. + + + + Args: + batch_size (`int` *optional*, defaults to 8): + The batch size per device (GPU/TPU core/CPU...) used for testing. + loss_only (`bool`, *optional*, defaults to `False`): + Ignores all outputs except the loss. + jit_mode (`bool`, *optional*): + Whether or not to use PyTorch jit trace for inference. + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_testing(batch_size=32) + >>> args.per_device_eval_batch_size + 32 + ``` + """ + self.do_predict = True + self.per_device_eval_batch_size = batch_size + self.prediction_loss_only = loss_only + self.jit_mode_eval = jit_mode + return self + + def set_save( + self, + strategy: Union[str, IntervalStrategy] = "steps", + steps: int = 500, + total_limit: Optional[int] = None, + on_each_node: bool = False, + ): + """ + A method that regroups all arguments linked to the evaluation. + + Args: + strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`): + The checkpoint save strategy to adopt during training. Possible values are: + + - `"no"`: No save is done during training. + - `"epoch"`: Save is done at the end of each epoch. + - `"steps"`: Save is done every `save_steps`. + + steps (`int`, *optional*, defaults to 500): + Number of updates steps before two checkpoint saves if `strategy="steps"`. + total_limit (`int`, *optional*): + If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in + `output_dir`. + on_each_node (`bool`, *optional*, defaults to `False`): + When doing multi-node distributed training, whether to save models and checkpoints on each node, or + only on the main one. + + This should not be activated when the different nodes use the same storage as the files will be saved + with the same names for each node. + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_save(strategy="steps", steps=100) + >>> args.save_steps + 100 + ``` + """ + self.save_strategy = IntervalStrategy(strategy) + if self.save_strategy == IntervalStrategy.STEPS and steps == 0: + raise ValueError("Setting `strategy` as 'steps' requires a positive value for `steps`.") + self.save_steps = steps + self.save_total_limit = total_limit + self.save_on_each_node = on_each_node + return self + + def set_logging( + self, + strategy: Union[str, IntervalStrategy] = "steps", + steps: int = 500, + report_to: Union[str, List[str]] = "none", + level: str = "passive", + first_step: bool = False, + nan_inf_filter: bool = False, + on_each_node: bool = False, + replica_level: str = "passive", + ): + """ + A method that regroups all arguments linked to the evaluation. + + Args: + strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`): + The logging strategy to adopt during training. Possible values are: + + - `"no"`: No save is done during training. + - `"epoch"`: Save is done at the end of each epoch. + - `"steps"`: Save is done every `save_steps`. + + steps (`int`, *optional*, defaults to 500): + Number of update steps between two logs if `strategy="steps"`. + level (`str`, *optional*, defaults to `"passive"`): + Logger log level to use on the main process. Possible choices are the log levels as strings: `"debug"`, + `"info"`, `"warning"`, `"error"` and `"critical"`, plus a `"passive"` level which doesn't set anything + and lets the application set the level. + report_to (`str` or `List[str]`, *optional*, defaults to `"none"`): + The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`, + `"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. Use `"all"` to report + to all integrations installed, `"none"` for no integrations. + first_step (`bool`, *optional*, defaults to `False`): + Whether to log and evaluate the first `global_step` or not. + nan_inf_filter (`bool`, *optional*, defaults to `True`): + Whether to filter `nan` and `inf` losses for logging. If set to `True` the loss of every step that is + `nan` or `inf` is filtered and the average loss of the current logging window is taken instead. + + + + `nan_inf_filter` only influences the logging of loss values, it does not change the behavior the + gradient is computed or applied to the model. + + + + on_each_node (`bool`, *optional*, defaults to `True`): + In multinode distributed training, whether to log using `log_level` once per node, or only on the main + node. + replica_level (`str`, *optional*, defaults to `"passive"`): + Logger log level to use on replicas. Same choices as `log_level` + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_logging(strategy="steps", steps=100) + >>> args.logging_steps + 100 + ``` + """ + self.logging_strategy = IntervalStrategy(strategy) + if self.logging_strategy == IntervalStrategy.STEPS and steps == 0: + raise ValueError("Setting `strategy` as 'steps' requires a positive value for `steps`.") + self.logging_steps = steps + self.report_to = report_to + self.log_level = level + self.logging_first_step = first_step + self.logging_nan_inf_filter = nan_inf_filter + self.log_on_each_node = on_each_node + self.log_level_replica = replica_level + return self + + def set_push_to_hub( + self, + model_id: str, + strategy: Union[str, HubStrategy] = "every_save", + token: Optional[str] = None, + private_repo: bool = False, + ): + """ + A method that regroups all arguments linked to synchronizing checkpoints with the Hub. + + + + Calling this method will set `self.push_to_hub` to `True`, which means the `output_dir` will begin a git + directory synced with the repo (determined by `model_id`) and the content will be pushed each time a save is + triggered (depending on`self.save_strategy`). Calling [`~Trainer.save_model`] will also trigger a push. + + + + Args: + model_id (`str`): + The name of the repository to keep in sync with the local *output_dir*. It can be a simple model ID in + which case the model will be pushed in your namespace. Otherwise it should be the whole repository + name, for instance `"user_name/model"`, which allows you to push to an organization you are a member of + with `"organization_name/model"`. + strategy (`str` or [`~trainer_utils.HubStrategy`], *optional*, defaults to `"every_save"`): + Defines the scope of what is pushed to the Hub and when. Possible values are: + + - `"end"`: push the model, its configuration, the tokenizer (if passed along to the [`Trainer`]) and a + draft of a model card when the [`~Trainer.save_model`] method is called. + - `"every_save"`: push the model, its configuration, the tokenizer (if passed along to the [`Trainer`]) + and + a draft of a model card each time there is a model save. The pushes are asynchronous to not block + training, and in case the save are very frequent, a new push is only attempted if the previous one is + finished. A last push is made with the final model at the end of training. + - `"checkpoint"`: like `"every_save"` but the latest checkpoint is also pushed in a subfolder named + last-checkpoint, allowing you to resume training easily with + `trainer.train(resume_from_checkpoint="last-checkpoint")`. + - `"all_checkpoints"`: like `"checkpoint"` but all checkpoints are pushed like they appear in the + output + folder (so you will get one checkpoint folder per folder in your final repository) + + token (`str`, *optional*): + The token to use to push the model to the Hub. Will default to the token in the cache folder obtained + with `huggingface-cli login`. + private_repo (`bool`, *optional*, defaults to `False`): + If True, the Hub repo will be set to private. + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_push_to_hub("me/awesome-model") + >>> args.hub_model_id + 'me/awesome-model' + ``` + """ + self.push_to_hub = True + self.hub_model_id = model_id + self.hub_strategy = HubStrategy(strategy) + self.hub_token = token + self.hub_private_repo = private_repo + return self + + def set_optimizer( + self, + name: Union[str, OptimizerNames] = "adamw_hf", + learning_rate: float = 5e-5, + weight_decay: float = 0, + beta1: float = 0.9, + beta2: float = 0.999, + epsilon: float = 1e-8, + args: Optional[str] = None, + ): + """ + A method that regroups all arguments linked to the optimizer and its hyperparameters. + + Args: + name (`str` or [`training_args.OptimizerNames`], *optional*, defaults to `"adamw_hf"`): + The optimizer to use: `"adamw_hf"`, `"adamw_torch"`, `"adamw_torch_fused"`, `"adamw_apex_fused"`, + `"adamw_anyprecision"` or `"adafactor"`. + learning_rate (`float`, *optional*, defaults to 5e-5): + The initial learning rate. + weight_decay (`float`, *optional*, defaults to 0): + The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights. + beta1 (`float`, *optional*, defaults to 0.9): + The beta1 hyperparameter for the adam optimizer or its variants. + beta2 (`float`, *optional*, defaults to 0.999): + The beta2 hyperparameter for the adam optimizer or its variants. + epsilon (`float`, *optional*, defaults to 1e-8): + The epsilon hyperparameter for the adam optimizer or its variants. + args (`str`, *optional*): + Optional arguments that are supplied to AnyPrecisionAdamW (only useful when + `optim="adamw_anyprecision"`). + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_optimizer(name="adamw_torch", beta1=0.8) + >>> args.optim + 'adamw_torch' + ``` + """ + self.optim = OptimizerNames(name) + self.learning_rate = learning_rate + self.weight_decay = weight_decay + self.adam_beta1 = beta1 + self.adam_beta2 = beta2 + self.adam_epsilon = epsilon + self.optim_args = args + return self + + def set_lr_scheduler( + self, + name: Union[str, SchedulerType] = "linear", + num_epochs: float = 3.0, + max_steps: int = -1, + warmup_ratio: float = 0, + warmup_steps: int = 0, + ): + """ + A method that regroups all arguments linked to the learning rate scheduler and its hyperparameters. + + Args: + name (`str` or [`SchedulerType`], *optional*, defaults to `"linear"`): + The scheduler type to use. See the documentation of [`SchedulerType`] for all possible values. + num_epochs(`float`, *optional*, defaults to 3.0): + Total number of training epochs to perform (if not an integer, will perform the decimal part percents + of the last epoch before stopping training). + max_steps (`int`, *optional*, defaults to -1): + If set to a positive number, the total number of training steps to perform. Overrides + `num_train_epochs`. In case of using a finite iterable dataset the training may stop before reaching + the set number of steps when all data is exhausted. + warmup_ratio (`float`, *optional*, defaults to 0.0): + Ratio of total training steps used for a linear warmup from 0 to `learning_rate`. + warmup_steps (`int`, *optional*, defaults to 0): + Number of steps used for a linear warmup from 0 to `learning_rate`. Overrides any effect of + `warmup_ratio`. + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_lr_scheduler(name="cosine", warmup_ratio=0.05) + >>> args.warmup_ratio + 0.05 + ``` + """ + self.lr_scheduler_type = SchedulerType(name) + self.num_train_epochs = num_epochs + self.max_steps = max_steps + self.warmup_ratio = warmup_ratio + self.warmup_steps = warmup_steps + return self + + def set_dataloader( + self, + train_batch_size: int = 8, + eval_batch_size: int = 8, + drop_last: bool = False, + num_workers: int = 0, + pin_memory: bool = True, + auto_find_batch_size: bool = False, + ignore_data_skip: bool = False, + sampler_seed: Optional[int] = None, + ): + """ + A method that regroups all arguments linked to the dataloaders creation. + + Args: + drop_last (`bool`, *optional*, defaults to `False`): + Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch + size) or not. + num_workers (`int`, *optional*, defaults to 0): + Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in + the main process. + pin_memory (`bool`, *optional*, defaults to `True`): + Whether you want to pin memory in data loaders or not. Will default to `True`. + auto_find_batch_size (`bool`, *optional*, defaults to `False`) + Whether to find a batch size that will fit into memory automatically through exponential decay, + avoiding CUDA Out-of-Memory errors. Requires accelerate to be installed (`pip install accelerate`) + ignore_data_skip (`bool`, *optional*, defaults to `False`): + When resuming training, whether or not to skip the epochs and batches to get the data loading at the + same stage as in the previous training. If set to `True`, the training will begin faster (as that + skipping step can take a long time) but will not yield the same results as the interrupted training + would have. + sampler_seed (`int`, *optional*): + Random seed to be used with data samplers. If not set, random generators for data sampling will use the + same seed as `self.seed`. This can be used to ensure reproducibility of data sampling, independent of + the model seed. + + Example: + + ```py + >>> from transformers import TrainingArguments + + >>> args = TrainingArguments("working_dir") + >>> args = args.set_dataloader(train_batch_size=16, eval_batch_size=64) + >>> args.per_device_train_batch_size + 16 + ``` + """ + self.per_device_train_batch_size = train_batch_size + self.per_device_eval_batch_size = eval_batch_size + self.dataloader_drop_last = drop_last + self.dataloader_num_workers = num_workers + self.dataloader_pin_memory = pin_memory + self.auto_find_batch_size = auto_find_batch_size + self.ignore_data_skip = ignore_data_skip + self.data_seed = sampler_seed + return self + + +class ParallelMode(Enum): + NOT_PARALLEL = "not_parallel" + NOT_DISTRIBUTED = "not_distributed" + DISTRIBUTED = "distributed" + SAGEMAKER_MODEL_PARALLEL = "sagemaker_model_parallel" + SAGEMAKER_DATA_PARALLEL = "sagemaker_data_parallel" + TPU = "tpu" -- Gitee From 807b3b85e63d88933f28499a79ed393789127f85 Mon Sep 17 00:00:00 2001 From: xinyanhe Date: Wed, 15 Nov 2023 16:51:31 +0800 Subject: [PATCH 2/5] [built-in][PyTorch][ReFormer for PyTorch]Submit ReFormer NPU adaption and performance improvement code --- PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md index 3d038b4a48..b9a592ec1d 100644 --- a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md @@ -148,8 +148,8 @@ Reformer: The Efficient Transformer是发表在ICLR 2020的一项工作,致力 **表 2** 训练结果展示表 | NAME | Acc@1 | FPS | Epochs | Torch_Version | |:-:|:-:|:-:|:-:|:-:| -| 1p-竞品A | - | 60.428 | 1 | 1.11 | -| 8p-竞品A | 0.1934 | 467.132 | 10 | 1.11 | +| 1p-竞品A | - | 43.7914 | 1 | 1.11 | +| 8p-竞品A | 0.1931 | 330.788 | 10 | 1.11 | | 1p-NPU | - | 30.6212 | 1 | 1.11 | | 8p-NPU | 0.1931 | 233.93 | 10 | 1.11 | -- Gitee From 9e9ecff54a28a9f57f1ec396f6179370d233371f Mon Sep 17 00:00:00 2001 From: xinyanhe Date: Wed, 15 Nov 2023 17:30:12 +0800 Subject: [PATCH 3/5] [built-in][PyTorch][ReFormer for PyTorch]Submit ReFormer NPU adaption and performance improvement code --- PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md index b9a592ec1d..55549e3e89 100644 --- a/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md +++ b/PyTorch/built-in/nlp/ReFormer_for_PyTorch/README.md @@ -66,7 +66,7 @@ Reformer: The Efficient Transformer是发表在ICLR 2020的一项工作,致力 ``` ## 获取预训练模型 -用户自行下载预训练模型至目录/checkpoint/reformer-crime-and-punishment下。Reformer预训练模型目录结构参考如下所示。 +在源码包目录下新建文件夹checkpoint/reformer-crime-and-punishment,用户自行下载预训练模型至目录/checkpoint/reformer-crime-and-punishment下。Reformer预训练模型目录结构参考如下所示。 ``` ├── checkpoint ├──reformer-crime-and-punishment @@ -84,7 +84,7 @@ Reformer: The Efficient Transformer是发表在ICLR 2020的一项工作,致力 1. 获取数据集。 - 用户自行获取数据集放至目录corpus下。 + 在源码包目录下新建文件夹corpus,用户自行获取数据集放至目录corpus下。 Reformer数据集目录结构参考如下所示。 @@ -98,7 +98,7 @@ Reformer: The Efficient Transformer是发表在ICLR 2020的一项工作,致力 >该数据集的训练过程脚本只作为一种参考示例。 2. 获取词表。 - 用户自行获取词表,并放在目录lf_token下,目录结构如下。 + 在源码包目录下新建文件夹lf_token,用户自行获取词表,并放在目录lf_token下,目录结构如下。 ``` ├── lf_token ├──config.json -- Gitee From 61d00c343c7302b3e5c6dd003b6d36c334b525dc Mon Sep 17 00:00:00 2001 From: xinyanhe Date: Tue, 5 Mar 2024 09:47:49 +0800 Subject: [PATCH 4/5] add Security Statement in README --- PyTorch/README.CN.md | 49 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/PyTorch/README.CN.md b/PyTorch/README.CN.md index 2e6b0669a9..0887a7ee1d 100644 --- a/PyTorch/README.CN.md +++ b/PyTorch/README.CN.md @@ -268,7 +268,54 @@ - [YoloV3_ID1790_for_PyTorch](https://gitee.com/ascend/ModelZoo-PyTorch/tree/master/PyTorch/built-in/cv/detection/YoloV3_ID1790_for_PyTorch) - [Yolov5_for_PyTorch_v4.0](https://gitee.com/ascend/modelzoo-GPL/tree/master/built-in/PyTorch/Official/cv/object_detection/Yolov5_for_PyTorch_v4.0) - + +## 安全声明 + +### 运行用户建议 + +出于安全性及权限最小化角度考虑,建议使用非root等管理员类型账户使用。 + +### 文件权限控制 + +1. 建议用户在主机(包括宿主机)及容器中设置运行系统umask值为0027及以上,保障新增文件夹默认最高权限为750,新增文件默认最高权限为640。 +2. 建议用户对个人数据、商业资产、源文件、训练过程中保存的各类文件等敏感内容做好权限管控,管控权限可参考表1进行设置。 +3. torch_npu中集成profiler工具,使用时会在本地生成性能分析数据,文件夹权限默认750,文件权限默认640,用户可根据实际需求对生成文件权限进行进阶管控。 + +表1 文件(夹)各场景权限管控推荐最大值 + +| 类型 | linux权限参考最大值 | +| -------------- | --------------- | +| 用户主目录 | 750(rwxr-x---) | +| 程序文件(含脚本文件、库文件等) | 550(r-xr-x---) | +| 程序文件目录 | 550(r-xr-x---) | +| 配置文件 | 640(rw-r-----) | +| 配置文件目录 | 750(rwxr-x---) | +| 日志文件(记录完毕或者已经归档) | 440(r--r-----) | +| 日志文件(正在记录) | 640(rw-r-----) | +| 日志文件目录 | 750(rwxr-x---) | +| Debug文件 | 640(rw-r-----) | +| Debug文件目录 | 750(rwxr-x---) | +| 临时文件目录 | 750(rwxr-x---) | +| 维护升级文件目录 | 770(rwxrwx---) | +| 业务数据文件 | 640(rw-r-----) | +| 业务数据文件目录 | 750(rwxr-x---) | +| 密钥组件、私钥、证书、密文文件目录 | 700(rwx—----) | +| 密钥组件、私钥、证书、加密密文 | 600(rw-------) | +| 加解密接口、加解密脚本 | 500(r-x------) | + +### 数据安全声明 + + PyTorch使用过程中需要加载和保存数据,部分接口使用风险模块pickle,可能存在数据风险,如torch.load、torch.distributed.scatter_object_list等接口,可参考[torch.load](https://pytorch.org/docs/stable/generated/torch.load.html#torch.load)、[collective-functions](https://pytorch.org/docs/stable/distributed.html#collective-functions)了解具体风险。 + +### 运行安全声明 + +1. 建议用户结合运行环境资源状况编写对应训练脚本。若训练脚本与资源状况不匹配,如数据集加载内存大小超出内存容量限制、训练脚本在本地生成数据超过磁盘空间大小等情况,可能引发错误并导致进程意外退出。 +2. PyTorch和torch_npu在运行异常时会退出进程并打印报错信息,属于正常现象。建议用户根据报错提示定位具体错误原因,包括通过设定算子同步执行、查看CANN日志、解析生成的Core Dump文件等方式。 + +### 公网地址声明 + +详见各模型目录下的public_address_statement.md + ## 免责声明 Ascend ModelZoo仅提供公共数据集下载和预处理脚本。这些数据集不属于ModelZoo,ModelZoo也不对其质量或维护负责。请确保您具有这些数据集的使用许可。基于这些数据集训练的模型仅可用于非商业研究和教育。 -- Gitee From 95211cd554df9fc2e0c66a20171ee7f9443d80b6 Mon Sep 17 00:00:00 2001 From: xinyanhe Date: Mon, 11 Mar 2024 09:51:07 +0800 Subject: [PATCH 5/5] add Security Statement in README --- PyTorch/README.CN.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/PyTorch/README.CN.md b/PyTorch/README.CN.md index 0b3a385e30..572c8c4f28 100644 --- a/PyTorch/README.CN.md +++ b/PyTorch/README.CN.md @@ -272,7 +272,7 @@ ### 运行用户建议 -出于安全性及权限最小化角度考虑,建议使用非root等管理员类型账户使用。 +出于安全性及权限最小化角度考虑,不建议使用root等管理员类型账户使用。 ### 文件权限控制 -- Gitee