# SDP4Bit **Repository Path**: ByteDance/SDP4Bit ## Basic Information - **Project Name**: SDP4Bit - **Description**: official implementation of paper SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-12 - **Last Updated**: 2025-04-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SDP4Bit This repository is the official implement of paper **[SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training](https://arxiv.org/abs/2410.15526)**. ## Overview SDP4Bit is a communication quantization strategy designed to reduce the overhead of large-scale distributed training in Sharded Data Parallelism (ShardedDP). By utilizing quantization on weight differences and two-level gradient smooth quantization, SDP4Bit reduces the communication of weights and gradients to nearly 4 bits without compromising accuracy. ## Paper Results Reproduce ### Preparing for Data In the data processing step, we followed the [data preprocessing instructions](https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#data-preprocessing) in Megatron-LM official repository. We use the [**pile deduplicated dataset**](https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated) provided by huggingface as our training baseline. For the vocabulary and merges file, we used same as gpt2 model. **Download** ``` from datasets import load_dataset train_data = load_dataset('EleutherAI/the_pile_deduplicated', split='train', num_proc=16) train_data.to_json(os.path.join(save_path, dataset_output_name), lines=True) hf_hub_download(repo_id="gpt2", filename="merges.txt", local_dir=save_path) hf_hub_download(repo_id="gpt2", filename="vocab.json", local_dir=save_path) ``` **Data Process** We used [preprocess script](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py) in Megatron-LM repository and the dataset download in last step. ``` python preprocess_data.py \ --input pile.jsonl \ --split train \ --columns text \ --output-prefix pile \ --vocab-file vocab.json \ --merge-file merges.txt \ --dataset-impl mmap \ --tokenizer-type GPT2BPETokenizer \ --append-eod \ --torch-backend mpi ``` ### Accucracy Test Results Reproduce ![enter image description here](https://github.com/jindajia/SDP4Bit/raw/main/Figures/accuracy_test_table.png) We set all models to run for a total of 80,000 training iterations. The learning rate was configured according to GPT-2 settings. Note: For each experimental group, we used the same training configuration for the same model, with only the quantization configuration being changed to ensure a fair comparison. The model configuration and detailed sample training scripts are provided below. **Model Card** ``` 125M MODEL_ARGS=" --num-layers 12 \ --hidden-size 768 \ --num-attention-heads 12 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " OPTIMIZER_ARGS=" --lr 0.0006 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.00006 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 8 \ --global-batch-size 256 \ --train-iters 80000 \ " ``` ``` GPT 350M Model MODEL_ARGS=" --num-layers 24 \ --hidden-size 1024 \ --num-attention-heads 16 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 8 \ --global-batch-size 256 \ --train-iters 80000 \ " OPTIMIZER_ARGS=" --lr 0.0003 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.00003 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ " ``` ``` GPT 1.3B Model MODEL_ARGS=" --num-layers 24 \ --hidden-size 2048 \ --num-attention-heads 16 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 2 \ --global-batch-size 256 \ --train-iters 80000 \ " OPTIMIZER_ARGS=" --lr 0.0002 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.00002 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ " ``` ``` GPT 6.7B Model MODEL_ARGS=" --num-layers 32 \ --hidden-size 4096 \ --num-attention-heads 32 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " OPTIMIZER_ARGS=" --lr 0.00012 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.000012 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 2 \ --global-batch-size 256 \ --train-iters 80000 \ " ``` **Sample Training Scripts** | Model |Baseline|qWD|TLq|TLq-HS|SDP4Bit| |--|--|--|--|--|--|--| | 125M |[link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/accuracy/125M/baseline/train.sh) |[link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/accuracy/125M/quantWeightDiff/train.sh) |[link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/accuracy/125M/quantGradwithoutHT/train.sh) |[link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/accuracy/125M/quantGrad/train.sh) |[link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/accuracy/125M/quantWeightDiff_Grad/train.sh)| ### Speed Test Results Reproduce ![enter image description here](https://github.com/jindajia/SDP4Bit/raw/main/Figures/speed_test_table.png) We provide the detailed speed test scripts on H800 as below. Please note that since H800 node contains 8 GPUs, and A100 node contains 4GPU, we adjust the tensor parallel size and pipeline parallel size accordingly. This adjustment ensures that tensor parallel size won't exceed the number of GPUs in a node. The parallel configuration and training scripts are provide below. | Model Size | TP | PP | Accumulation Step | |------------|-----|-----|-------------------| | 1.3B | 1 | 1 | 1 | | 2.7B | 1 | 1 | 1 | | 6.7B | 4 | 1 | 1 | | 13B | 4(A100)/8(H800) | 2(A100)/1(H800) | 1 | | 18B | 4(A100)/8(H800) | 2(A100)/1(H800) | 1 | **Traing Scripts** | Model Size | Baseline | SDP4Bit | |------------|----------------------|----------------------| | 1.3B | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/1_3B_Baseline.sh) | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/1_3B_QWG.sh) | | 2.7B | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/2_7B_Baseline.sh) | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/2_7B_QWG.sh) | | 6.7B | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/6_7B_Baseline.sh) | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/6_7B_QWG.sh) | | 13B | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/13B_Baseline.sh) | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/13B_QWG.sh) | | 18B | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/18B_Baseline.sh) | [link](https://github.com/jindajia/Megatron-LM/blob/jinda/final_speed_test/sample_scripts/speed/Exp1/18B_QWG.sh) | ## Arguments Usage Belows are all quantization arguments you may use on your case. ### Weight Quantization Arguments - `--quantized-weights` - Weight Communication will be quantized when this is enable - Default: not enabled - `--weight-quantization-bits 4` - Specifies the number of bits used for quantizing weights. - Default: 4 - `--wq-group-size 2048` - Defines the group size for weight quantization. - Default: 2048 ### Gradient Quantization Arguments - `--quantized-gradients` - Gradient Communication will be quantized when this is enable - Default: not enabled - `--gq-group-size-inter 128` - Defines the group size for gradient quantization between nodes (inter-node). - Default: 128 - `--gradient-quantization-bits-inter 4` - Specifies the number of bits used for inter-node gradient quantization. - Default: 4 - `--gq-group-size-intra 128` - Defines the group size for gradient quantization within nodes (intra-node). - Default: 512 - `--gradient-quantization-bits-intra 8` - Specifies the number of bits used for intra-node gradient quantization. - Default: 8 - `--hadamard-transform` - Enable this to reduce Gradient Quantization error. - Default: not enabled - `--gradient-alltoall-pipeline 8` - Chunk gradients to overlap intra and inter node communication. - Default: 1 ### Additional Settings - `--no-async-tensor-model-parallel-allreduce` - To overlap intra and inter node all-to-all, this should be enabled to avoid setting --- **Note:** The implementation of SDP4Bit is built upon the official [Nvidia Megatron-LM](https://github.com/NVIDIA/Megatron-LM/commit/bd6f4ead41dac8aa8d50f46253630b7eba84bcdf), leveraging its optimized framework for large-scale language model training. ## Citation If you think this work is helpful, please kindly cite as: ```bibtex @inproceedings{jiasdp4bit, title={SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training}, author={Jia, Jinda and Xie, Cong and Lu, Hanlin and Wang, Daoce and Feng, Hao and Zhang, Chengming and Sun, Baixi and Lin, Haibin and Zhang, Zhi and Liu, Xin and others}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems} } ```