# ChatGLM3 ManualReset

**Repository Path**: wxjch2002/chatglm3-manual-reset

## Basic Information

- **Project Name**: ChatGLM3 ManualReset
- **Description**: chatglm3基于香橙派AIPro部署
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 5
- **Created**: 2024-12-07
- **Last Updated**: 2024-12-07

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ChatGLM3 ManualReset 

## 简介

本项目基于香橙派部署ChatGLM3-6B大语言模型。

本项目基于[ascend-llm](https://gitee.com/yinghuo302/ascend-llm)实现,原项目由南京大学计算机科学与技术系杜骋同学主导,由朱光辉老师进行指导,昇腾CANN团队提供技术支持。

## 关键技术
- 静态图方案

    在Transformer模型中,基于模型的自回归推理特性,业界普遍采用kvcache缓存的方式增加模型的推理性能.kvcache会缓存上一次推理得到的kv矩阵用于本次推理,大大减少了推理计算量.
    
    由于缓存的kv矩阵要和当前输入字符计算出的kv矩阵进行拼接,因此每次推理完整的kv矩阵长度一直在增加,致使模型shape不固定,会走动态推理流程,存在大量算子编译时间,推理性能大大下降.
    
    本方案基于原先动态图方案,将kv矩阵固定到一个最大长度,结合attention_mask屏蔽输入序列部分位置的特性实现了静态图的方案.在kvcache达到上限时通过主动清理kvcache的方式让模型可以反复推理.

- 量化方案

    大模型权重过大,在端侧设备由于内存限制通常难以运行,因此通常将大模型权重从fp16量化到int8甚至int4降低内存消耗.

    本项目采用SmoothQuant量化方案,通过对权重和激活值均采用int8量化，显著节省了内存并提升了推理速度.

## 快速启动
### 启动步骤
1. 下载[香橙派0318镜像](https://www.hiascend.com/forum/thread-0231149828762292018-1-1.html)，烧录到sd卡，启动环境，参考[香橙派AIpro快速上手指南](https://www.hiascend.com/forum/thread-0260140249549075069-1-1.html)。 
2. 使用root用户登录环境，clone当前仓到空闲目录  
    ```
    git clone https://gitee.com/wan-zutao/chatglm3-manual-reset.git chatglm
    cd chatglm/inference
    ```
3. 运行download.sh，下载model,tokenizer文件。  
    ```
    bash download.sh
    ```
4. 启动程序  
    ```
    python3 main.py
    ```
5.  根据打屏信息，在浏览器输入地址，即可开启对话  
    浏览器的输入命令行打印的url地址
    ![输入图片说明](inference/picture/%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_1.png)
    ![输入图片说明](inference/picture/%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_2.png)

## 模型推理
### 环境准备
1. CANN安装
    烧录香橙派最新镜像,推荐烧录sd卡大于等于64g   
2. 环境变量配置
    ```
    export TE_PARALLEL_COMPILER=1
    export MAX_COMPILE_CORE_NUMBER=1
    ```
3. 自定义算子部署
    - protoc安装
    ```
    # 安装protoc==1.13.0, 找一空闲目录下载
    wget  https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/wanzutao/tiny-llama/protobuf-all-3.13.0.tar.gz
    tar -zxvf protobuf-all-3.13.0.tar.gz
    cd protobuf-3.13.0
    apt-get update
    apt-get install autoconf automake libtool
    ./autogen.sh 
    ./configure
    make -j4
    make install
    sudo ldconfig
    protoc --version # 查看版本号
    ```

    - 算子编译部署
    ```
    将./custom_op/matmul_integer_plugin.cc 拷贝到指定路径
    cp custom_op/matmul_integer_plugin.cc /usr/local/Ascend/ascend-toolkit/latest/tools/msopgen/template/custom_operator_sample/DSL/Onnx/framework/onnx_plugin/
    cd /usr/local/Ascend/ascend-toolkit/latest/tools/msopgen/template/custom_operator_sample/DSL/Onnx 
    ```
    打开build.sh，找到下面四个环境变量，解开注释并修改如下：
    ```
    export ASCEND_TENSOR_COMPILER_INCLUDE=/usr/local/Ascend/ascend-toolkit/latest/include
    export TOOLCHAIN_DIR=/usr
    export AICPU_KERNEL_TARGET=cust_aicpu_kernels
    export AICPU_SOC_VERSION=Ascend310B4
    ```
    编译运行
    ```
    ./build.sh 
    cd build_out/
    ./custom_opp_ubuntu_aarch64.run
    # 生成文件到customize到默认目录 $ASCEND_PATH/opp/vendors/，删除冗余文件
    cd $ASCEND_PATH/opp/vendors/customize
    rm -rf op_impl/ op_proto/
    ```


4. 依赖安装

    进入项目根目录
   
    ```
    cd inference
    pip install -r requirements.txt -i https://mirrors.huaweicloud.com/repository/pypi/simple
    ```

### 推理运行
1.  下载tokenizer文件
    ```
    cd tokenizer
    wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/wanzutao/chatglm3/tokenizer.zip
    unzip tokenizer.zip   
    ```
2.  获取onnx模型文件
    ```
    cd ../model
    wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/wanzutao/chatglm3/chatglm.onnx
    wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/wanzutao/chatglm3/f14e3fa0-2c77-11ef-ad6e-cc64a6d0938f
    ```
3.  atc模型转换(需要开启至少4g的swap内存,可参考链接[swap开启](https://www.hiascend.com/forum/thread-0238118740068545008-1-1.html))
    ```
    atc --framework=5 --model="chatglm.onnx" --output="chatglm" --soc_version=Ascend310B4 --precision_mode=must_keep_origin_dtype --input_format=ND --input_shape="input_ids:1,1;position_ids:1,1;attention_mask:1,1025;past_key_values:28,2,1024,1,2,128"
    ```
4.  脚本启动
    ```
    cd ..
    python3 main.py
    ```
5.  打开浏览器访问链接
    浏览器的url栏输入命令行打印的url地址
    ![输入图片说明](inference/picture/%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_1.png)
6.  聊天框输入开始大模型对话
    ![输入图片说明](inference/picture/%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_2.png)