# ax-llm **Repository Path**: axera-opensource/ax-llm ## Basic Information - **Project Name**: ax-llm - **Description**: No description available - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: ax-context - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-14 - **Last Updated**: 2025-09-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # AX-LLM ![GitHub License](https://img.shields.io/github/license/AXERA-TECH/ax-llm) | Platform | Build Status | | -------- | ------------ | | AX650 | ![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/AXERA-TECH/ax-llm/build_650.yml)| ## 简介 **AX-LLM** 由 **[爱芯元智](https://www.axera-tech.com/)** 主导开发。该项目用于探索业界常用 **LLM(Large Language Model)** 在已有芯片平台上落地的可行性和相关能力边界,**方便**社区开发者进行**快速评估**和**二次开发**自己的 **LLM 应用**。 ### 分支说明 - [ax-context(default)](https://github.com/AXERA-TECH/ax-llm/tree/ax-context) - AX650A/AX650N/AX8850/AX630C Host 运行 LLM 使用 - [ax-internvl](https://github.com/AXERA-TECH/ax-llm/tree/ax-internvl) - AX650A/AX650N/AX8850/AX630C Host 运行 InternVL 系列使用 - [axcl-context](https://github.com/AXERA-TECH/ax-llm/tree/axcl-context) - AX650N/AX8850 EP 的主控运行 LLM 系列使用 - [axcl-internvl](https://github.com/AXERA-TECH/ax-llm/tree/axcl-internvl) - AX650N/AX8850 EP 的主控运行 InternVL 系列使用 ### 已支持芯片 - AX650A/AX650N - SDK ≥ v3.0.0 - AX630C - SDK ≥ v3.0.0 ### 已支持模型 - Qwen2.5 - Qwen3 - MiniCPM - SmolLM2 - Llama3 ### 获取地址 我们的 ModelZoo 已迁移到 [Huggingface](https://huggingface.co/AXERA-TECH), 例如: - [Qwen2.5-7B-Instruct](https://huggingface.co/AXERA-TECH/Qwen2.5-7B-Instruct) - [Qwen2.5-1.5B-Instruct](https://huggingface.co/AXERA-TECH/Qwen2.5-1.5B-Instruct) ## 源码编译 - clone 本项目 ```shell git clone --recursive https://github.com/AXERA-TECH/ax-llm.git cd ax-llm ``` - 仔细阅读 `build.sh` ,并在 `build.sh` 正确修改 `BSP_MSP_DIR` 变量后,运行编译脚本 ```shell ./build.sh ``` - 正确编译后,`build/install/` 目录 ``` $ tree install install └── bin ├── gradio_demo.py ├── main ├── main_api └── qwen2.5_tokenizer_uid.py ``` 其中 `main` 就是 Huggingface 仓库中对应的 `main_ax650` ## 运行示例 ### Qwen2.5-1.5B-Instruct #### 运行支持上下文的 tokenizer 服务器 ```shell python qwen2.5_tokenizer_uid.py Server running at http://127.0.0.1:12345 ``` #### 运行命令行 llm ```shell ./run_qwen2.5_1.5b_ctx_ax650.sh [I][ Init][ 110]: LLM init start [I][ Init][ 34]: connect http://127.0.0.1:12345 ok [I][ Init][ 57]: uid: 4bba0928-fada-4329-903e-3b6e52d68791 bos_id: -1, eos_id: 151645 100% | ████████████████████████████████ | 31 / 31 [18.94s<18.94s, 1.64 count/s] init post axmodel ok,remain_cmm(1464 MB) [I][ Init][ 188]: max_token_len : 2559 [I][ Init][ 193]: kv_cache_size : 256, kv_cache_num: 2559 [I][ Init][ 201]: prefill_token_num : 128 [I][ Init][ 205]: grp: 1, prefill_max_token_num : 1 [I][ Init][ 205]: grp: 2, prefill_max_token_num : 512 [I][ Init][ 205]: grp: 3, prefill_max_token_num : 1024 [I][ Init][ 205]: grp: 4, prefill_max_token_num : 1536 [I][ Init][ 205]: grp: 5, prefill_max_token_num : 2048 [I][ Init][ 209]: prefill_max_token_num : 2048 [I][ load_config][ 282]: load config: { "enable_repetition_penalty": false, "enable_temperature": true, "enable_top_k_sampling": true, "enable_top_p_sampling": false, "penalty_window": 20, "repetition_penalty": 1.2, "temperature": 0.9, "top_k": 10, "top_p": 0.8 } [I][ Init][ 218]: LLM init ok Type "q" to exit, Ctrl+c to stop current running [I][ GenerateKVCachePrefill][ 271]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2 [I][ GenerateKVCachePrefill][ 308]: input_num_token:21 [I][ main][ 230]: precompute_len: 21 [I][ main][ 231]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant. prompt >> hello,my name is allen,who are you [I][ SetKVCache][ 531]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:18 [I][ SetKVCache][ 534]: current prefill_max_token_num:1920 [I][ Run][ 660]: input token num : 18, prefill_split_num : 1 [I][ Run][ 686]: input_num_token:18 [I][ Run][ 829]: ttft: 539.49 ms Hello Allen! I'm sorry, but I'm an AI language model and I don't have a name. I'm just here to help you with any questions or information you need. How can I assist you today? [N][ Run][ 943]: hit eos,avg 10.80 token/s [I][ GetKVCache][ 500]: precompute_len:83, remaining:1965 prompt >> 我叫什么名字 [I][ SetKVCache][ 531]: prefill_grpid:2 kv_cache_num:512 precompute_len:83 input_num_token:12 [I][ SetKVCache][ 534]: current prefill_max_token_num:1920 [I][ Run][ 660]: input token num : 12, prefill_split_num : 1 [I][ Run][ 686]: input_num_token:12 [I][ Run][ 829]: ttft: 538.67 ms 你的名字是Allen。 [N][ Run][ 943]: hit eos,avg 10.57 token/s [I][ GetKVCache][ 500]: precompute_len:100, remaining:1948 ``` #### 运行api以及gradio demo ##### 启动服务器 ```shell ./run_qwen2.5_1.5b_ctx_ax650_api.sh [I][ Init][ 110]: LLM init start [I][ Init][ 34]: connect http://10.126.33.124:12345 ok [I][ Init][ 57]: uid: 13c64c2a-9b4e-4875-91f4-fa9f426e3726 bos_id: -1, eos_id: 151645 3% | ██ | 1 / 31 [0.15s<4.77s, 6.49 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap 100% | ████████████████████████████████ | 31 / 31 [2.97s<2.97s, 10.44 count/s] init post axmodel ok,remain_cmm(1464 MB)[I][ Init][ 188]: max_token_len : 2559 [I][ Init][ 193]: kv_cache_size : 256, kv_cache_num: 2559 [I][ Init][ 201]: prefill_token_num : 128 [I][ Init][ 205]: grp: 1, prefill_max_token_num : 1 [I][ Init][ 205]: grp: 2, prefill_max_token_num : 512 [I][ Init][ 205]: grp: 3, prefill_max_token_num : 1024 [I][ Init][ 205]: grp: 4, prefill_max_token_num : 1536 [I][ Init][ 205]: grp: 5, prefill_max_token_num : 2048 [I][ Init][ 209]: prefill_max_token_num : 2048 [I][ load_config][ 282]: load config: { "enable_repetition_penalty": false, "enable_temperature": true, "enable_top_k_sampling": true, "enable_top_p_sampling": false, "penalty_window": 20, "repetition_penalty": 1.2, "temperature": 0.9, "top_k": 10, "top_p": 0.8 } [I][ Init][ 218]: LLM init ok Server running on port 8000... ``` 获取板端 ip,并修改 gradio 代码中的 ip 地址 ``` import time import gradio as gr import requests import json # Base URL of your API server; adjust host and port as needed API_URL = "http://x.x.x.x:8000" ... ``` 运行 gradio_demo.py ``` python gradio_demo.py /home/axera/ax-llm/scripts/gradio_demo.py:102: UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys. chatbot = gr.Chatbot(elem_id="chatbox", label="Axera Chat",height=500) * Running on local URL: http://0.0.0.0:7860 To create a public link, set `share=True` in `launch()`. ``` ![](scripts/gradio_demo.png) ## Reference - [Qwen](https://huggingface.co/Qwen) ## 技术讨论 - Github issues - QQ 群: 139953715