# llm-serving

**Repository Path**: takaer3/llm-serving

## Basic Information

- **Project Name**: llm-serving
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 14
- **Created**: 2024-04-19
- **Last Updated**: 2024-04-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# MindSpore LLM-Serving

### serving is a fast and easy-to-use LLM inference framework

---
### Features
- model parallel deployment
- token streaming via Server-Sent Events
- custom model inputs
- static/continuous batching of prompts
- do post sampling via npu
- PagedAttention
- kbk inference

### Supports the most popular LLMs, including the following architectures:
- LLaMA-2
- InternLM
- baichuan2
- wizardcoder

### Get Started
#### 环境依赖
- python 3.9
- [mindspore](https://www.mindspore.cn/install)
- [mindspore-lite](https://www.mindspore.cn/lite/docs/zh-CN/master/use/downloads.html?highlight=%E5%AE%89%E8%A3%85) 
- [mindformers-dev](https://gitee.com/mindspore/mindformers#%E4%BA%8Cmindformers%E5%AE%89%E8%A3%85)
- easydict
- transformers==4.35.0

#### 一键安装whl包
```shell
bash build.sh
pip install mindspore-serving-xxx.whl
```

注：后处理当前按照入图的方式进行，使用serving前请使用post_sampling_model.py重新导出后处理模型，保证数据类型与LLM模型的输出类型一致；
```shell
python tools/post_sampling_model.py --output_dir ./target_dir
# args
#   output_dir: 后处理模型生成的目录
```

#### 修改模型对应的配置文件

##### 带PagedAttention配置

###### yaml文件

在模型对应的配置文件`configs/模型名称/xxx.yaml`中，用户可自行修改模型，并通过`page_attention`开启PA的模型训练（True为启动模型PA功能，并在后面添加`pa_config`的设置项，具体参数根据模型来设置）

```
model_path:
    prefill_model: ["/path/to/baichuan2/output_serving/mindir_full_checkpoint/rank_0_graph.mindir"]
    decode_model: ["/path/to/baichuan2/output_serving/mindir_inc_checkpoint/rank_0_graph.mindir"]
    argmax_model: "/path/to/serving/target_dir/argmax.mindir"
    topk_model: "/path/to/target_dir/topk.mindir"
    prefill_ini: ["/path/to/baichuan_ini/910b_default_ctx.cfg"]
    decode_ini: ["/path/to/baichuan_ini/910b_default_inc.cfg"]
    post_model_ini: "/path/to/serving/target_dir/config.ini"
model_config:
    model_name: 'baichuan2'
    max_generate_length: 4096
    end_token: 2
    seq_length: [1024, 2048, 4096]	#支持多分档
    vocab_size: 125696
    prefill_batch_size: [1]     #带PA功能只支持单batch
    decode_batch_size: [16]  	#带PA功能只支持多batch，但不支持多分档
    zactivate_len: [4096]
    model_type: 'dyn'
    page_attention: True   # True为启动模型PA功能
    current_index: False
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 0   
    
pa_config:  #带PA的配置此项，根据模型设置参数
    num_blocks: 512     
    block_size: 16
    decode_seq_length: 4096
    
serving_config:
    agent_ports: [61166]
    start_device_id: 0
    server_ip: 'localhost'
    server_port: 61155    
    
tokenizer:
    type: LlamaTokenizer
    vocab_file: '/path/to/llama_pa_models/output/tokenizer_llama2_13b.model'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs
```

###### prefill_ini

```
[ascend_context]
provider=ge

[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype

[graph_kernel_param]
opt_level=2
enable_cce_lib=true
disable_cce_lib_ops=MatMul
disable_cluster_ops=MatMul,Reshape

[ge_graph_options]
ge.inputShape=batch_valid_length:1;slot_mapping:-1;tokens:1,-1
ge.dynamicDims=64,64;128,128;256,256;512,512;1024,1024;2048,2048;4096,4096	
#	必须包含yaml文件中model_config的seq_length的档位[1024,2048,4096]
ge.dynamicNodeType=1
```

###### decode_ini

```
[ascend_context]
provider=ge

[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype

[graph_kernel_param]
opt_level=2
enable_cce_lib=true
disable_cce_lib_ops=MatMul
disable_cluster_ops=MatMul,Reshape

[ge_graph_options]
ge.inputShape=batch_valid_length:-1;block_tables:-1,256;slot_mapping:-1;tokens:-1,1		
#	block_tables中的256是根据yaml配置中的pa_config下的decode_seq_length/block_size得来
ge.dynamicDims=1,1,1,1;2,2,2,2;8,8,8,8;16,16,16,16;64,64,64,64	
#	ge.inputShape中有几个“-1”，便每组有几个数（有4个-1，所以有4个1、4个2、4个...），且必须包含yaml配置中的pa_config下decode_batch_size的batch数
ge.dynamicNodeType=1
```


##### 不带PagedAttention配置

```
model_path:
    prefill_model: ["/path/to/llama_pa_models/no_act/output_no_act_len/output/mindir_full_checkpoint/rank_0_graph.mindir"]
    decode_model: ["/path/to/llama_pa_models/no_act/output_no_act_len/output/mindir_inc_checkpoint/rank_0_graph.mindir"]
    argmax_model: "/path/to/serving_dev/extends_13b/argmax.mindir"
    topk_model: "/path/to/serving_dev/extends_13b/topk.mindir"
    prefill_ini: ["/path/to/llama_pa_models/no_act/ini/910b_default_prefill.cfg"]
    decode_ini: ["/path/to/llama_pa_models/no_act/ini/910_inc.cfg"]
    post_model_ini: "/path/to/baichuan/congfig/config.ini"
model_config:
    model_name: 'llama_dyn'
    max_generate_length: 8192
    end_token: 2
    seq_length: [512, 1024]
    vocab_size: 32000
    prefill_batch_size: [64]			#不支持多分档，只支持多batch
    decode_batch_size: [1,4,8,16,30,64]	#支持多分档
    zactivate_len: [512]
    model_type: "dyn"	#若无此字段默认为“dyn”，若有此字段需指定model_type
    current_index: False
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 0  
    
serving_config:
    agent_ports: [11330]
    start_device_id: 5
    server_ip: 'localhost'
    server_port: 19200
    
tokenizer:
    type: LlamaTokenizer
    vocab_file: '/path/to/llama_pa_models/output/tokenizer_llama2_13b.model'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs
```

###### prefill_ini

```
[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype
```

###### decode_ini

```
[ascend_context]
provider=ge
[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype
[ge_graph_options]
ge.inputShape=batch_index:-1;batch_valid_length:-1;tokens:-1,1;zactivate_len:-1
ge.dynamicDims=1,1,1,512;4,4,4,512;8,8,8,512;16,16,16,512;30,30,30,512;64,64,64,512
#	根据yaml配置中的model_config下的decode_batch_size确定decode档位；“512”根据yaml中的zactivate_len得到
ge.dynamicNodeType=1
```


##### WizardCoder配置（静态shape）

```
model_path:
    prefill_model: ["/path/to/prefill_model.mindir"]
    decode_model: ["/path/to/decode_model.mindir"]
    argmax_model: "/path/to/argmax.mindir"
    topk_model: "/path/to/topk.mindir"
    prefill_ini : ['/path/to/lite.ini']
    decode_ini: ['/path/to/lite.ini']
    post_model_ini: '/path/to/lite.ini'

model_config:
    model_name: 'llama_dyn'
    max_generate_length: 4096
    end_token: 0
    seq_length: [2048]
    vocab_size: 49153
    prefill_batch_size: [1]
    decode_batch_size: [1]
    zactivate_len: [2048]
    model_type: 'static'
    seq_type: 'static'
    batch_waiting_time: 0.0
    decode_batch_waiting_time: 0.0
    batching_strategy: 'continuous'
    current_index: False
    page_attention: False
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 49152

serving_config:
    agent_ports: [9980]
    start_device_id: 0
    server_ip: 'localhost'
    server_port: 12359

tokenizer:
    type: WizardCoderTokenizer
    vocab_file: '/path/to/transformers_config'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs
```

###### lite_ini

```
[ascend_context]
plugin_custom_ops=All
provider=ge
[ge_session_options]
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
```

##### 支持mindspore kbk 训推一体化 yaml 配置

###### 带PA yaml 配置

```
model_path:
    prefill_model: ["/path/llama2-13b-mindir/full_graph.mindir"]
    decode_model: ["/path/llama2-13b-mindir/inc_graph.mindir"]
    argmax_model: "/path/post_process/argmax.mindir"
    topk_model: "/path/post_process/topk.mindir"
    prefill_ini : ['/path/llma2_13b_pa_dyn_prefill.ini']
    decode_ini: ['/path/llma2_13b_pa_dyn_decode.ini']
    post_model_ini: '/path/post_config.ini'

model_config:
    model_name: 'llama_7b'
    max_generate_length: 4096
    end_token: 2
    seq_length: [4096]
    vocab_size: 32000
    prefill_batch_size: [1]
    decode_batch_size: [1]
    zactivate_len: [512, 1024, 2048, 4096]
    model_type: 'dyn'
    seq_type: 'static'
    batch_waiting_time: 0.0
    decode_batch_waiting_time: 0.0
    batching_strategy: 'continuous'
    current_index: False
    page_attention: True
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 0
    backend: 'kbk' # 'ge'
    model_cfg_path: 'checkpoint_download/llama/llama_7b.yaml'

serving_config:
    agent_ports: [14002]
    start_device_id: 7
    server_ip: 'localhost'
    server_port: 19291

pa_config:
    num_blocks: 2048
    block_size: 16
    decode_seq_length: 4096

tokenizer:
    type: LlamaTokenizer
    vocab_file: '/home/wsc/llama/tokenizer/tokenizer.model'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs
```

#### 设置环境变量，变量配置如下

######  方式一：使用已有脚本启动

```
source /path/to/xxx-serving.sh
```

```
export PYTHONPATH=/path/to/mindformers-ft-2:/path/to/serving/:$PYTHONPATH
```

###### 	    		方式二：镜像

下载好docker镜像后创建容器

```
# --device用于控制指定容器的运行NPU卡号和范围
# -v 用于映射容器外的目录
# --name 用于自定义容器名称
# /bin/bash前的是镜像ID，可以用指令docker images查看

docker run -it -u root \
--ipc=host \
--network host \
--device=/dev/davinci0 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /etc/localtime:/etc/localtime \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /var/log/npu/:/usr/slog \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
--name {请手动输入容器名称} \
XXX /bin/bash
```

```
export PYTHONPATH=/path/to/mindformers-ft-2:/path/to/serving/:$PYTHONPATH
```

#### 启动

```shell
python examplse/start.py --config configs/xxx.yaml# 先后拉起模型和serving进程
```
启动参数：config: 模型对应的yaml文件, refer to [model.yaml](configs/internLM_dyn.yaml)

#### 发起请求
通过“/models/model_name/generate”和"/models/model_name/generate_stream" 进行请求

```shell
curl 127.0.0.1:9800/models/llama2/generate \
     -X POST \
     -d '{"inputs":"Hello?","parameters":{"max_new_tokens":55, "do_sample":"False", "return_full_text":"True"}, "stream":"True"}' \
     -H 'Content-Type: application/json'
```

```shell
curl 127.0.0.1:9800/models/llama2/generate_stream \
     -X POST \
     -d '{"inputs":"Hello?","parameters":{"max_new_tokens":55, "do_sample":"False", "return_full_text":"True"}, "stream":"True"}' \
     -H 'Content-Type: application/json'
```

或者通过python API

```python
from mindspore_serving.client import MindsporeInferenceClient

client = MindsporeInferenceClient(model_type="llama2", server_url="http://127.0.0.1:8080")

# 1. test generate
text = client.generate("what is Monetary Policy?").generated_text
print('text: ', text)

# 2. test generate_stream
text = ""
for response in client.generate_stream("what is Monetary Policy?", do_sample=False, max_new_tokens=200):
    print("response 0", response)
    if response.token:
        text += response.token.text
    else:
        text = response.generated_text
print(text)
```