# lora-fast

**Repository Path**: mirrors_huggingface/lora-fast

## Basic Information

- **Project Name**: lora-fast
- **Description**: Minimal repository to demonstrate fast LoRA inference with Flux family of models.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-25
- **Last Updated**: 2026-06-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# lora-fast

Minimal repository to demonstrate fast LoRA inference with [Flux.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) using different settings that can help with speed or memory efficiency. Please check the accompanying blog post at [this URL](https://huggingface.co/blog/lora-fast).

The included benchmark script allows to experiment with:

- FlashAttention3
- `torch.compile`
- Quantization
- LoRA hot-swapping
- CPU offloading

## Key results

| Option              | Time (s) ⬇️ | Speedup (vs baseline) ⬆️ | Notes                                                                        |
|---------------------|-------------|--------------------------|------------------------------------------------------------------------------|
| baseline            | 7.8910      | –                        | Baseline                                                                     |
| optimized           | 3.5464      | 2.23×                    | Hotswapping \+ compilation without recompilation hiccups (FP8 on by default) |
| no\_fp8             | 4.3520      | 1.81×                    | Same as optimized, but with FP8 quantization disabled                        |
| no\_fa3             | 4.3020      | 1.84×                    | Disable FA3 (flash-attention v3)                                             |
| baseline \+ compile | 5.0920      | 1.55×                    | Compilation on, but suffers from intermittent recompilation stalls           |
| no\_fa3\_fp8        | 5.0850      | 1.55×                    | Disable FA3 and FP8                                                          |
| no\_compile\_fp8    | 7.5190      | 1.05×                    | Disable FP8 quantization and compilation                                     |
| no\_compile         | 10.4340     | 0.76×                    | Disable compilation: the slowest setting                                     |

## Installation

The requirements for this repository are listed in the `requirements.txt`, please ensure they are installed in your Python environment, e.g. by running:

`python -m pip install -r requirements.txt`.

### FlashAttention3

Optionally, use FlashAttention3 for even better performance. This requires a Hopper GPU (e.g. H100). Follow the [install instructions here](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release).

## Running the benchmarks

Run the benchmarks using the provided `run_benchmark.py` script. To check the available arguments, run:

```sh
python run_benchmark.py --help
```

If you want to run a battery of different settings, some shell scripts are provided to achieve that. Use `run_experiments.sh` if you have a server GPU like an H100. Use `run_exps_rtx_4090.sh` if you have a consumer GPU with 24 GB of memory, like an RTX 4090. The benchmark data and sample images are stored by default in the `results/` directory.

## Standalone script

The `inference_lora.py` script implements the optimizations in sequence and it is geared towards an H100. Refer to it for a simpler reference than `run_benchmark.py`. Users should
only refer to this script in case they are not interested in conducting without running
benchmarking.