# ComfyUI-ExLlama-Nodes **Repository Path**: comfyui_custom_nodes/ComfyUI-ExLlama-Nodes ## Basic Information - **Project Name**: ComfyUI-ExLlama-Nodes - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-07-19 - **Last Updated**: 2024-07-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ComfyUI ExLlamaV2 Nodes A simple local text generator for [ComfyUI](https://github.com/comfyanonymous/ComfyUI) using [ExLlamaV2](https://github.com/turboderp/exllamav2). ## Installation Clone the repository to `custom_nodes` and install the requirements: ``` git clone https://github.com/Zuellni/ComfyUI-ExLlama-Nodes custom_nodes/ComfyUI-ExLlamaV2-Nodes pip install -r custom_nodes/ComfyUI-ExLlamaV2-Nodes/requirements.txt ``` Use wheels for [ExLlamaV2](https://github.com/turboderp/exllamav2/releases/latest) and [FlashAttention](https://github.com/bdashore3/flash-attention/releases/latest) on Windows: ``` pip install exllamav2-X.X.X+cuXXX.torch2.X.X-cp3XX-cp3XX-win_amd64.whl pip install flash_attn-X.X.X+cuXXX.torch2.X.X-cp3XX-cp3XX-win_amd64.whl ``` ## Usage Only EXL2, 4-bit GPTQ and unquantized models are supported. You can find them on [Hugging Face](https://huggingface.co). To use a model with the nodes, you should clone its repository with `git` or manually download all the files and place them in `models/llm`. For example, if you want to download the 6-bit [Llama-3-8B-Instruct](https://huggingface.co/turboderp/Llama-3-8B-Instruct-exl2), use the following command: ``` git install lfs git clone https://huggingface.co/turboderp/Llama-3-8B-Instruct-exl2 -b 6.0bpw models/llm/Llama-3-8B-Instruct-exl2-6.0bpw ``` > [!TIP] > You can add your own `llm` path to the [extra_model_paths.yaml](https://github.com/comfyanonymous/ComfyUI/blob/master/extra_model_paths.yaml.example) file and put the models there instead. ## Nodes
| Loader | Loads models from the llm directory. |
|
| cache_bits | A lower value reduces VRAM usage, but also affects generation speed and quality. | |
| fast_tensors | Enabling reduces RAM usage and speeds up model loading. | |
| flash_attention | Enabling reduces VRAM usage, not supported on cards with compute capability below 8.0. |
|
| max_seq_len | Max context, higher value equals higher VRAM usage. 0 will default to model config. |
|
| Generator | Generates text based on the given prompt. Refer to SillyTavern for sampler parameters. | |
| unload | Unloads the model after each generation to reduce VRAM usage. | |
| stop_conditions | List of strings to stop generation on, e.g. ["\n"] to stop on newline. Leave empty to only stop on eos token. |
|
| max_tokens | Max new tokens, 0 will use available context. |
|
| Previewer | Displays generated text in the UI. | |
| Replacer | Replaces variable names in brackets, e.g. [a], with their values. |
|