# llama-intel-arc **Repository Path**: techwolf/llama-intel-arc ## Basic Information - **Project Name**: llama-intel-arc - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-12-15 - **Last Updated**: 2023-12-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 🦙 llama.cpp for Intel ARC We can run llama2 model on Intel ARC Series (A40/A50/A380/A770) ## 1. Install ARC driver Intel driver install guide: https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html - BIOS Setting Above 4G Decoding -> Enabled Re-Size BAR Support -> Enabled - Add package repository ```shell sudo apt-get install -y gpg-agent wget wget -qO - https://repositories.intel.com/graphics/intel-graphics.key | \ sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg echo 'deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc' | \ sudo tee /etc/apt/sources.list.d/intel.gpu.jammy.list ``` - Install DKMS kernel modules ```shell sudo apt-get install -y intel-platform-vsec-dkms intel-platform-cse-dkms sudo apt-get install -y intel-i915-dkms intel-fw-gpu ``` - Install run-time packages ```shell sudo apt-get install -y gawk libc6-dev udev\ intel-opencl-icd intel-level-zero-gpu level-zero \ intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \ libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \ libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \ mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo ``` ## 2. Install OneAPI - Install intel-base toolkit ```shell sudo apt install intel-basekit ``` - Source Environment ```shell export ONEAPI_ROOT=/opt/intel/oneapi export DPCPPROOT=${ONEAPI_ROOT}/compiler/latest export MKLROOT=${ONEAPI_ROOT}/mkl/latest export IPEX_XPU_ONEDNN_LAYOUT=1 source ${ONEAPI_ROOT}/setvars.sh > /dev/null ``` ## 3. Install CLBlast Build binary ``` git clone https://github.com/CNugteren/CLBlast.git cd CLBlast mkdir build cd build cmake .. -DOPENCL_INCLUDE_DIRS="/opt/intel/oneapi/compiler/latest/linux/include/sycl" -DOPENCL_LIBRARIES="/opt/intel/oneapi/compiler/latest/linux/lib/libOpenCL.so" cmake --build . --config Release ``` copy oneapi resource, and setup clblast enviroment, In next step will use CLBlast, record your clblast lib/include ``` mkdir env cd env mkdir lib cp /opt/intel/oneapi/compiler/latest/linux/lib ./ mkdir include cp /opt/intel/oneapi/compiler/latest/linux/include ./ mkdir cmake cd cmake gedit CLBlastConfig.cmake ``` copy paste below content to CLBlastConfig.cmake ```shell set(CLBLast_INCLUDE_DIRS "${CMAKE_CURRENT_LIST_DIR}/../include") set(CLBLast_LIBRARIES "${CMAKE_CURRENT_LIST_DIR}/../lib/libclblast.so") ``` ## 4. Install llama.cpp Build binary and config enviroment ``` git clone https://github.com/ggerganov/llama.cpp cd llama.cpp mkdir build cd build cmake .. -DLLAMA_CLBLAST=ON -DCMAKE_PREFIX_PATH="/home/taicun/Desktop/dev/llama/CLBlast/env" cmake --build . --config Release ``` ## 5. Download Model Download llama2 model from huggingface ```shell mkdir model cd model sudo apt install git-lfs git lfs install git clone https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF ``` Note: Need to check your Intel ARC graph card vram size more than required, in our example use llama-2-7b-chat.Q4_0.gguf for arc-a40(vram 6g) | Name | Quant method | Bits | Size | Max RAM required | Use case | | ------------------------------------------------------------------------------------------------------------------------- | ------------ | ---- | ------- | ---------------- | ---------------------------------------------------------------------- | | [llama-2-7b-chat.Q2_K.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q2_K.gguf) | Q2_K | 2 | 2.83 GB | 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | [llama-2-7b-chat.Q3_K_S.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q3_K_S.gguf) | Q3_K_S | 3 | 2.95 GB | 5.45 GB | very small, high quality loss | | [llama-2-7b-chat.Q3_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q3_K_M.gguf) | Q3_K_M | 3 | 3.30 GB | 5.80 GB | very small, high quality loss | | [llama-2-7b-chat.Q3_K_L.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q3_K_L.gguf) | Q3_K_L | 3 | 3.60 GB | 6.10 GB | small, substantial quality loss | | [llama-2-7b-chat.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_0.gguf) | Q4_0 | 4 | 3.83 GB | 6.33 GB | legacy; small, very high quality loss - prefer using Q3_K_M | | [llama-2-7b-chat.Q4_K_S.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_S.gguf) | Q4_K_S | 4 | 3.86 GB | 6.36 GB | small, greater quality loss | | [llama-2-7b-chat.Q4_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf) | Q4_K_M | 4 | 4.08 GB | 6.58 GB | medium, balanced quality - recommended | | [llama-2-7b-chat.Q5_0.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q5_0.gguf) | Q5_0 | 5 | 4.65 GB | 7.15 GB | legacy; medium, balanced quality - prefer using Q4_K_M | | [llama-2-7b-chat.Q5_K_S.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q5_K_S.gguf) | Q5_K_S | 5 | 4.65 GB | 7.15 GB | large, low quality loss - recommended | | [llama-2-7b-chat.Q5_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q5_K_M.gguf) | Q5_K_M | 5 | 4.78 GB | 7.28 GB | large, very low quality loss - recommended | | [llama-2-7b-chat.Q6_K.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q6_K.gguf) | Q6_K | 6 | 5.53 GB | 8.03 GB | very large, extremely low quality loss | | [llama-2-7b-chat.Q8_0.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q8_0.gguf) | Q8_0 | 8 | 7.16 GB | 9.66 GB | very large, extremely low quality loss - not recommended | ## 6. Run llama.cpp setup system prompt ```shell mkdir prompt cd prompt cat "Transcript of a dialog, where the User interacts with an Assistant named iEi. \ iEi is helpful, kind, honest, good at writing, \ and never fails to answer the User's requests immediately and with precision." > chat-with-iei.txt ``` run llama.cpp by interactive mode ``` cd bin ./main -m ./model/test-llama2-7b-q4/llama-2-7b-32k-instruct.Q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompt/chat-with-iei.txt ``` ## 7. Result Share our test result - HW: run on Intel ARC Pro A50 ![image](./image/1.png) - HW: run on Intal ARC A380 ```shell ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics' ggml_opencl: selecting device: 'Intel(R) Arc(TM) A380 Graphics' ggml_opencl: device FP16 support: true llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from models/Llama-2-7B-32K-Instruct-GGUF/lla ma-2-7b-32k-instruct.Q4_0.gguf (version GGUF V2 (latest)) .... llm_load_tensors: ggml ctx size = 0.09 MB llm_load_tensors: using OpenCL for GPU acceleration llm_load_tensors: mem required = 3647.96 MB (+ 256.00 MB per state) llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: VRAM used: 0 MB .................................................................................................. llama_new_context_with_model: kv self size = 256.00 MB llama_new_context_with_model: compute buffer total size = 71.97 MB system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, m irostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 ``` - Q&A correlation test ![image](./image/2.png) - llama2-7B-Q4 model on Intel Pro A50 speed test ![video](./image/3.mp4) - token log [token.log](./image/4.txt) - validate on llama2-chinese-7B-Q4 ![image](./image/5.png) - performance ``` llama_print_timings: load time = 354.90 ms llama_print_timings: sample time = 16.90 ms / 268 runs ( 0.06 ms per token, 15854.24 tokens per second) llama_print_timings: prompt eval time = 18657.43 ms / 151 tokens ( 123.56 ms per token, 8.09 tokens per second) llama_print_timings: eval time = 52514.95 ms / 267 runs ( 196.69 ms per token, 5.08 tokens per second) llama_print_timings: total time = 5815420.87 ms ``` ## Disclaimer Read the [disclaimer statement](disclaimer.md)