# ai-llm

**Repository Path**: ysite/ai-llm

## Basic Information

- **Project Name**: ai-llm
- **Description**: test
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: disaggregation
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-16
- **Last Updated**: 2025-07-17

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## 启动RDMA零拷贝传输
默认情况下nvidia-peermem模块不会自动加载,需要使用sudo modprobe nvidia_peermem命令手动加载此模块。
执行lsmod | grep nvidia_peermem命令检查此模块是否正确加载。
nvidia-peermem模块与NVIDIA的GPU和支持的第三方对等设备（如Mellanox的RDMA NIC）兼容。
nvidia-peermem是NVIDIA GPUDirect RDMA技术中的重要组件，它允许GPU之间进行高效的直接内存访问。
Mooncake采用(GPUDirect) RDMA技术实现以下特性：
+ 数据直接从发起方的DRAM/VRAM传输到目标的DRAM/SSD 
+ +完全绕过CPU的零拷贝机制
+ 多网卡资源聚合利用

## mooncake-transfer-engine安装
> pip install --break-system-packages mooncake-transfer-engine -i https://pypi.tuna.tsinghua.edu.cn/simple

## 运行PD分离
### 特权模式启动容器
> docker run --rm -itd  --privileged -v /data/models/:/home/models/ --tmpfs /dev/shm:size=20G --gpus all --cpus=4 --memory=20GB ai-runtime-infer:torch27-pd-0619 /bin/bash

### 运行prefill服务
> CUDA_VISIBLE_DEVICES=0 python3 -m ace.launch_server \
  --model-path /home/models/Qwen2.5-14B-Instruct \
  --port 7000 \
  --host 0.0.0.0 \
  --context-length 16384 \
  --tp-size 1 \
  --gpu-mem-use 0.8 \
  --enable-mixed-chunk \
  --chunked-prefill-size 4096 \
  --pd-mode prefill \
  --pd-transfer-backend mooncake \
  --pd-ib-device mlx5_bond_0 \
  --trust-remote-code \
  --warmups 3 


### 运行decode服务,多个服务需避免端口冲突
> CUDA_VISIBLE_DEVICES=5 python3 -m ace.launch_server \
  --model-path /home/models/Qwen2.5-14B-Instruct \
  --port 7001 \
  --host 0.0.0.0 \
  --context-length 16384 \
  --tp-size 1 \
  --gpu-mem-use 0.85 \
  --enable-mixed-chunk \
  --pd-mode decode \
  --pd-transfer-backend mooncake \
  --pd-ib-device mlx5_bond_0 \
  --trust-remote-code \
  --warmups 3 

> CUDA_VISIBLE_DEVICES=6 python3 -m ace.launch_server \
  --model-path /home/models/Qwen2.5-14B-Instruct \
  --port 7002 \
  --host 0.0.0.0 \
  --context-length 16384 \
  --tp-size 1 \
  --gpu-mem-use 0.85 \
  --enable-mixed-chunk \
  --pd-mode decode \
  --pd-transfer-backend mooncake \
  --pd-ib-device mlx5_bond_0 \
  --trust-remote-code \
  --warmups 3 


## 运行负载服务,多个地址用逗号分割
> python3 -m sglang.srt.disaggregation.mini_lb \
  --prefill http://127.0.0.1:7000 \
  --prefill-bootstrap-ports 9898 \
  --decode http://127.0.0.1:7001,http://127.0.0.1:7002 \
  --host 0.0.0.0 \
  --port 8000

## 运行负载服务,多个地址用逗号分割
> python3 -m ace.llm.decoupled.launch_lb \
  --rust-lb \
  --prefill http://127.0.0.1:7000 \
  --prefill-bootstrap-ports 9898 \
  --decode http://127.0.0.1:7001 http://127.0.0.1:7002 \
  --host 0.0.0.0 \
  --port 8000

## 测试脚本
> curl -X POST 'http://localhost:8000/v1/chat/completions' -H 'Content-Type: application/json' -d '{"messages":[{"content":"介绍下西湖十景","role":"user"}],"model":"qwen","stream":false,"temperature":0.7}'