# mixture-of-experts

**Repository Path**: whirlwindMo/mixture-of-experts

## Basic Information

- **Project Name**: mixture-of-experts
- **Description**: MoE with Database -- one simple pre work
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2023-11-25
- **Last Updated**: 2024-03-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# MoEC

## FairSeq介绍
根据[MoEc仓库的指示](#setup)进行安装

[官方文档](https://fairseq.readthedocs.io/)

[官方github](https://github.com/facebookresearch/fairseq/tree/main)

1. FairSeq是由Facebook开发的NLP框架，集成了数据预处理（下载，转化为Token等等操作），目前新的模型（BERT，T5，BARD，GPT-series）还有很多的训练手段（多机多卡，单机多卡，单机单卡，混精度训练，WarmUp等）
2. 支持自定义学习任务（以下是一些定义好的学习任务），详细的学习任务可以在`fairseq/fairseq/tasks`里头看到
    - Language Modeling (这就是Pretrain,GPT使用Casual LM，预测下个单词，BERT使用Masked LM，随机遮住一个句子的15%内容，预测这15%内容)
    - Translation
    - Text to Speech
    - ...

    **在这里我们不需要自定义学习任务，只需要按照现有的任务框架作为Baseline。**
3. 支持自定义网络架构，详情看这里：https://fairseq.readthedocs.io/en/latest/models.html#adding-new-models， 现有的架构有：
    - CNN Based Models
    - LSTM/RNN
    - Transformer Based Models

    自定义模型需要Rigister架构和模型：
    - `@register_model('lstm')`
    - `@register_model_architecture('lstm', 'lstm_luong_wmt_en_de')`

    模型是一套算法，架构相当于不同的参数规模/任务，比如GPT是model，那么我们有GPT small，medium，large，XL等多个尺寸，这就是架构，那么我们就注册4个architecture：
    - `@register_model_architecture('GPT', 'GPT_small')`
    - `@register_model_architecture('GPT', 'GPT_large')`
    - ...

    具体的例子在`fairseq/fairseq/models/masked_lm.py`344行可以找到，masked_lm里头有bert_base/large/XL等size的模型，到时候我们用的时候就直接命令行调用`--arch lstm_luong_wmt_en_de`就可以了

    这个步骤比较重要，我们的算法改进主要在模型上面，所以需要进行新模型的注册，可以看一下`fairseq/fairseq/models/`下面的代码和官方文档 https://fairseq.readthedocs.io/en/latest/models.html#adding-new-models 来进行了解。

4. 支持自定义Loss函数，Optimizer等（也是像模型一样，需要注册），在这里我们不需要修改这些玩意，所以跳过。
5. 看了很多篇论文，他们都是基于FairSeq的，有一个疑惑：到底是先pre-train再进行下游任务（GPT2的方式）还是直接训练下游任务数据集，搞不来，先看着办。此外，这个StableMoE中所谓的`RoBERTa+cc100en corpus`我也没看到在哪里下载，难过。总结来说虽然能跑代码，但是没有基础知识，也不会debug(这都是分布式训练模式，还不知道怎么debug)。
6. 关于具体的一些内容，可以查看`./notebooks`里面的几个Jupyter notebook。
## Pretrain 任务
- 该死的作者不放出预训练代码，头痛极了，详情看[这个issue](https://github.com/xy980523/MoEc_model/issues/1)
    ```
    Thank you for your attention. This repo currently only supports ''--task translation'', not ''--task pretraining''. The code related to pretraining will not be open source
    ```
- 目前只会`wikitext-103`数据集，这个数据集很小，具体教程在这里 https://github.com/facebookresearch/fairseq/tree/main/examples/language_model ，或者访问`fairseq/fairseq/examples/language_model`文件夹

- 下面[原仓库的Readme中](#原来仓库的readme)有翻译任务的相关shell，当然，数据预处理还是可以参照`fairseq/examples/translation`中的shell脚本来下载并预处理数据

- 使用双卡4090练一个普通的Transformer(transformer_lm,6 layer, decoder only)需要3个小时才能完成一个Epoch，而且这只是在`wikitext-103`数据集上，这个数据集100多MB，要是真到了`RoBERTa+cc100en corpus`上头(有大概30GB)，得练到猴年马月？

    ```python
    TransformerLanguageModel(
    (decoder): TransformerDecoder(
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(267744, 512, padding_idx=1)
        (embed_positions): SinusoidalPositionalEmbedding()
        (layers): ModuleList(
        (0-5): 6 x TransformerDecoderLayerBase(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
            (dropout_module): FairseqDropout()
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        )
        (output_projection): Linear(in_features=512, out_features=267744, bias=False)
    )
    )
    ```
1. 下载数据集
    ```bash
    cd examples/language_model/
    bash prepare-wikitext-103.sh
    cd ../..
    ```
2. 预处理，将数据集转化成Token，存到bin文件中
    ```bash
    TEXT=examples/language_model/wikitext-103
    fairseq-preprocess \
        --only-source \
        --trainpref $TEXT/wiki.train.tokens \
        --validpref $TEXT/wiki.valid.tokens \
        --testpref $TEXT/wiki.test.tokens \
        --destdir data-bin/wikitext-103 \
        --workers 20
    ```

3. 执行预训练任务
    ```bash
    fairseq-train --task language_modeling \
            data-bin/wikitext-103 \
            --save-dir checkpoints/transformer_wikitext-103 \
            --log-format simple
            --log-file ./logs/lm/train.log
            --arch transformer_lm \
            --share-decoder-input-output-embed \
            --dropout 0.1 \
            --optimizer adam \
            --adam-betas '(0.9, 0.98)' \
            --weight-decay 0.01 \
            --clip-norm 0.0 \
            --lr 0.0005 \
            --lr-scheduler inverse_sqrt \
            --warmup-updates 4000 \
            --warmup-init-lr 1e-07 \
            --tokens-per-sample 512 \
            --sample-break-mode none \
            --max-tokens 512 \
            --update-freq 16 \
            --fp16 \
            --max-update 50000
    ```
    If you run out of memory, try reducing `--max-tokens` (max number of tokens per batch) or `--tokens-per-sample` (max sequence length). You can also adjust `--update-freq` to accumulate gradients and simulate training on a different number of GPUs.

4. 预测
    ```bash
    fairseq-eval-lm data-bin/wikitext-103 \
        --path checkpoints/transformer_wiki103/checkpoint_best.pt \
        --batch-size 2 \
        --tokens-per-sample 512 \
        --context-window 400
    # | Evaluated 245569 tokens in 56.1s (4379.02 tokens/s)
    # | Loss: 3.4164, Perplexity: 30.46
    ```

## Translation 任务
可以按照[原仓库的Readme部分](#原仓库的readme部分)的指示运行，我还没试过，可以先从简单模型上手，比如最经典的transformer。此外，从作者提供的命令行来看，好像并不是先在大数据集上进行pre-train然后再在翻译数据集上fine-tune，是直接在翻译数据集上面进行训练。

## 原仓库的Readme部分
Code for paper - MoEC: Mixture of Expert Clusters https://arxiv.org/abs/2207.09094     
accepted in AAAI 2023   https://ojs.aaai.org/index.php/AAAI/article/view/26617/26389

Please follow [fairseq document](https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model) to data pre-processing.

注：在作者的仓库中`--arch gdmoe`是作者提出的MoEc算法，`--arch tmoe`是[X-moe](https://arxiv.org/pdf/2204.09179.pdf)作为强baseline使用
## Setup
Build:
```bash
# pip install --user -e fairseq/
# pip install --user -e infinibatch/
# 使用--user option会安装到~/.local文件夹里头去，这与我们的conda环境冲突，所以去掉，然后修改--default-timeout 1000 防止下载超时
pip install -e fairseq/ --default-timeout 1000
pip install -e infinibatch/ --default-timeout 1000
pip install -U numpy --default-timeout 1000
```

## Data Pre-processing
We take machine translation as an example.

```bash
# Download and prepare the data
cd examples/translation/
# WMT'17 data:
bash prepare-wmt14en2de.sh
# or to use WMT'14 data:
# bash prepare-wmt14en2de.sh --icml17
cd ../..

# Binarize the dataset
TEXT=examples/translation/wmt17_en_de
fairseq-preprocess \
    --source-lang en --target-lang de \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 \
    --workers 20
```


## Training
```bash
python -m torch.distributed.launch --nproc_per_node=8 train.py /path/wmt17_en_de_data/ \
        --save-dir /path/moec64/ckpt \
        --tensorboard-logdir /path/moec64/tb_logs \
        --log-format simple  --log-file /path/moec64/train.log \
        --arch gdmoe_wmt_en_de \
        --encoder-normalize-before \
        --task translation \
        --truncate-source \
        --max-source-positions 256 \
        --max-target-positions 256 \
        --criterion label_smoothed_cross_entropy_moe --label-smoothing 0.1 \
        --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \
        --lr-scheduler inverse_sqrt --lr 5e-04 --warmup-init-lr 1e-07 --stop-min-lr 1e-09 --warmup-updates 250 \
        --max-update 32000 \
        --attention-dropout 0.1 --dropout 0.3 \
        --max-tokens 4096 --update-freq 16 \
        --seed 1 \
        --skip-invalid-size-inputs-valid-test --fp16 --fp16-no-flatten-grads \
        --ddp-backend=no_c10d \
        --token-shuffle --moe-gate-loss-wt 0.01  --moe-gate-loss-combine-method sum \
        --no-epoch-checkpoints --clip-norm 0.1 \
        --encoder-moe-layers 3 --decoder-moe-layers 3 \
        --moe-top1-expert \
        --moe-sublayers 3 \
        --moe-expert-count 64 \
        --moe-gating-use-fp32 --tmoe-routing-dim-reduction \
        --tmoe-routing-dim 32 \
        --tmoe-routing-hard-cosine \
        --moe-activation-dropout 0.0 --moe-dropout 0.0 \
        --capacity-factor 2 \
        --sharded-save \
        --group-num 8 --exp-level-drop 0.5  --dropout-interval 250 --var-coef 1.0 --coef-type 1
        
```

## Inference
```
python -m torch.distributed.launch --nproc_per_node=1 generate_moe.py /path/wmt17_en_de_data/  \
    --path /path/moec64/ckpt/checkpoint_best.pt \
    --arch gdmoe_wmt_en_de \
    --task translation \
    --batch-size 128 --beam 5 \
    --model-overrides "{'coef_type':'1','encoder_moe_layers':'3', 'decoder_moe_layers':'3', 'moe_top1_expert':True, 'moe_sublayers':3, 'moe_expert_count':64,  'tmoe_routing_dim':32}"
```