diff --git a/docs/deeplearning_operators/tmac_gpu.md b/docs/deeplearning_operators/tmac_gpu.md index 18d73fd5ac815a013e1b4341fe87c450abd2fd22..7c285db47f235c3eac446ca3e6fdc9fcf768a36a 100644 --- a/docs/deeplearning_operators/tmac_gpu.md +++ b/docs/deeplearning_operators/tmac_gpu.md @@ -1,2 +1,265 @@ TMAC: Look Up Table Based Mixed Precision Computing ==================================================== + +
+ Author: zhangzhangJ +
+ +:::{warning} +:class: myclass1 myclass2 +:name: a-tip-reference + + This document describes the **TMAC (Look-Up Table Based Mixed Precision Computing)** operator implementation on MACA GPUs. + + **Architecture Note:** + * **Theoretical TMAC:** Replaces multiplication entirely with LUT lookups (Ideal for CPU AVX-512 / FPGA). + * **GPU TMAC (This Operator):** Implemented as **W4A16 Dequantization**. It uses LUTs to decode 4-bit weights on-the-fly to save 4x bandwidth, while leveraging **Tensor Cores** for high-throughput matrix multiplication. +::: + +**TMAC** is a high-performance operator designed to accelerate quantized neural networks. By storing weights in a compressed 4-bit format and decoding them using a **Look-Up Table (LUT)** during computation, it significantly reduces the "Memory Wall" bottleneck in Large Language Model (LLM) inference. + +--- + +## 1. Operator Function + +### Application Scenarios +* **LLM Quantization (W4A16):** The industry standard for deploying 4-bit quantized models (e.g., LLaMA-3, Mixtral) on GPUs. It enables running larger models on devices with limited Video RAM (VRAM) and bandwidth. +* **Memory-Bound Linear Layers:** Accelerates Fully Connected (FC) layers in Transformers where arithmetic intensity is low, and performance is strictly limited by weight loading speed. + +### Core Calculation Logic +The implementation of TMAC varies significantly depending on the underlying hardware architecture. + +#### 1. Theoretical Foundation ("True TMAC") + +* **Target Hardware:** CPUs (AVX-512), FPGAs, PIM (Processing-in-Memory). +* **Concept:** Eliminate expensive multiplication entirely. +* **Process:** + 1. **Pre-computation:** For an activation $A$, compute a temporary vector containing the product of $A$ with all possible 4-bit weights ($0..15$). + 2. **Lookup-Accumulate:** Use the 4-bit weight index $B$ to *select* (shuffle) the result from the temporary vector and accumulate it. + 3. **Formula:** $C += \text{LUT}_A[B]$ (No Multiplication). + +#### 2. GPU Adaptation ("TMAC Variant") + +* **Target Hardware:** MACA C500, NVIDIA GPUs. +* **Constraint:** GPUs derive 90%+ of their peak throughput from **Tensor Cores**, which are fixed-function **Matrix-Multiply-Accumulate (MMA)** units. They do not support pure lookup-based computation efficiently. +* **Implementation Strategy (W4A16 Dequantization):** + We adopt a hybrid approach: use LUTs to solve the **Bandwidth Bottleneck**, but use Tensor Cores to solve the **Compute Bottleneck**. + 1. **Compressed Loading:** Load Packed Int8 weights (saving 75% bandwidth). + 2. **On-Chip Dequantization:** Use LUTs in Shared Memory to map 4-bit indices back to FP16 values ($W = \text{LUT}[B]$). + 3. **Tensor Core Compute:** Perform standard matrix multiplication ($C += A \times W$) using the reconstructed weights. + +### 💡 Design Rationale: Why "Variant" on GPU? +* **True TMAC (CPU/FPGA):** Eliminates multiplication by using Shuffle/Lookup instructions. +* **GPU Reality:** Modern GPUs derive 90%+ of their peak throughput from **Tensor Cores**, which are fixed-function Matrix-Multiply-Accumulate (MMA) units. +* **The Optimal Strategy:** We use LUTs to solve the **Bandwidth Bottleneck** (W4A16 decoding) but retain the **Compute Power** of Tensor Cores. Implementing "True TMAC" (pure lookup) on GPUs would force the use of slower CUDA Cores, resulting in a 10x-50x performance drop. + +--- + +## 2. Interface Parameters + +### Input Parameters + +* **`A` (Activations)** + * **Shape:** `(M, K)` + * **Data Type:** `float16` + * **Description:** The input feature matrix. +* **`B_packed` (Compressed Weights)** + * **Shape:** `(K, N // 2)` + * **Data Type:** `int8` + * **Description:** 4-bit quantized weight matrix packed into 8-bit integers. **Physical N is halved** because each byte stores two 4-bit weights. +* **`LUT` (Quantization Codebook)** + * **Shape:** `(16, )` + * **Data Type:** `float16` + * **Description:** A look-up table containing the 16 representative FP16 values for the 4-bit indices ($0..15$). + +### Output Parameters + +* **`C` (Result)** + * **Shape:** `(M, N)` + * **Data Type:** `float16` + * **Description:** The result of the matrix multiplication $C = A \times \text{Dequant}(B)$. + +### Optional Parameters (Configuration) + +* **`block_M` / `block_N`**: Tile size for the output dimension (default: 64). +* **`block_K`**: Tile size for the reduction dimension (default: 32). +* **`num_stages`**: Software pipeline depth to hide memory latency (default: 2). +* **`threads`**: Number of threads per block (default: 128). + +--- + +## 3. Usage Example & Verification + +The following Python script uses `mc_tilelang` to implement the TMAC kernel. + +### Verification Methodology +To ensure the mathematical correctness of the GPU W4A16 kernel, we establish a **Ground Truth Baseline** using CPU simulation: +1. **CPU Baseline ("True TMAC"):** A pure Python implementation that strictly follows the theoretical $C += A \times LUT[B]$ logic (Lookup-Accumulate) using FP32 precision. This serves as the mathematical golden standard. +2. **GPU Target ("TMAC Variant"):** The high-performance W4A16 TileLang kernel. +3. **Comparison:** We verify that the GPU kernel produces results identical to the theoretical pure-LUT logic (within FP16 tolerance). + +### Complete Code + +```python +import torch +import tilelang +import tilelang.language as T +import time + +# ================================================================= +# 1. GPU Kernel: TMAC W4A16 Implementation (MACA Optimized) +# ================================================================= +def tmac_gpu_w4a16(M, N, K, block_M=64, block_N=64, block_K=32, num_stages=2, threads=128): + """ + TMAC GPU Kernel. + Logic: Packed Load -> Bitwise Unpack -> LUT Dequant -> Tensor Core GEMM. + """ + # Physical width is half of logical width (4-bit packing) + N_packed = N // 2 + block_N_packed = block_N // 2 + total_steps = (K + block_K - 1) // block_K + + @T.prim_func + def main( + A: T.Tensor((M, K), "float16"), + B_packed: T.Tensor((K, N_packed), "int8"), + LUT: T.Tensor((16, ), "float16"), + C: T.Tensor((M, N), "float16"), + ): + with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by): + # Shared Memory Allocations + A_shared = T.alloc_shared((block_M, block_K), "float16") + B_shared_packed = T.alloc_shared((block_K, block_N_packed), "int8") + B_shared_fp16 = T.alloc_shared((block_K, block_N), "float16") + C_local = T.alloc_fragment((block_M, block_N), "float32") + + T.clear(C_local) + + # Pipelined Loop + for k in T.Pipelined(0, total_steps, num_stages=num_stages): + # 1. Global -> Shared (High Bandwidth Efficiency) + T.copy(A[by * block_M, k * block_K], A_shared) + T.copy(B_packed[k * block_K, bx * block_N_packed], B_shared_packed) + + # 2. TMAC Logic: Unpack & Lookup in Shared Memory + # This prepares valid FP16 data for Tensor Cores + for i, j in T.Parallel(block_K, block_N_packed): + packed_byte = B_shared_packed[i, j] + + # Unpack Lower 4 bits: val & 0x0F + idx_low = packed_byte & 15 + B_shared_fp16[i, 2*j] = LUT[idx_low] + + # Unpack Upper 4 bits: (val >> 4) & 0x0F + # Masking is required to handle sign extension + idx_high = (packed_byte >> 4) & 15 + B_shared_fp16[i, 2*j + 1] = LUT[idx_high] + + # 3. Compute (Tensor Core MMA) + T.gemm(A_shared, B_shared_fp16, C_local) + + # Store Result + T.copy(C_local, C[by * block_M, bx * block_N]) + + return main + +# ================================================================= +# 2. CPU Logic: True TMAC Simulation (For Verification) +# ================================================================= +def tmac_true_logic_cpu(A, B_indices, LUT): + """ + Simulates 'True TMAC' math: Output += A * LUT[Index]. + This verifies the mathematical correctness of the Dequant+GEMM approach. + """ + # 1. Lookup / Shuffle (Vectorized on CPU) + # B_indices are 0-15 integers + decoded_W = LUT[B_indices.long()] + + # 2. Accumulate + # In hardware, this is done without forming the full W matrix + return torch.matmul(A, decoded_W) + +# ================================================================= +# 3. Benchmark & Verification +# ================================================================= +def run_benchmark(): + M, N, K = 2048, 2048, 2048 + print(f"=== TMAC Benchmark (W4A16): M={M}, N={N}, K={K} ===") + + device = "cuda" if torch.cuda.is_available() else "cpu" + if device == "cpu": + print("Error: No GPU found.") + return + + torch.manual_seed(0) + + # --- Data Preparation --- + # 1. Activations + a_fp16 = torch.randn(M, K, device=device, dtype=torch.float16) + + # 2. Weights (4-bit Indices) + indices = torch.randint(0, 16, (K, N), device=device, dtype=torch.uint8) + + # 3. LUT (Codebook) + lut = torch.randn(16, device=device, dtype=torch.float16) + + # 4. Pack Indices for GPU (int8) + indices_even = indices[:, 0::2] + indices_odd = indices[:, 1::2] + b_packed = (indices_even | (indices_odd << 4)).to(torch.int8) + + # --- Compilation --- + print("Compiling TileLang Kernel...") + func = tmac_gpu_w4a16(M, N, K) + # Automatically detects MACA/CUDA target + kernel = tilelang.compile(func, out_idx=[3]) + + # --- Performance Test --- + print("Benchmarking GPU Performance...") + # Warmup + kernel(a_fp16, b_packed, lut) + torch.cuda.synchronize() + + profiler = kernel.get_profiler() + latency = profiler.do_bench() + tflops = (2 * M * N * K) / (latency * 1e-3) / 1e12 + print(f"Latency: {latency:.3f} ms") + print(f"Throughput: {tflops:.2f} TFLOPS") + + # --- Correctness Verification --- + print("\nVerifying against CPU True TMAC Logic...") + # Run GPU Kernel + c_gpu = kernel(a_fp16, b_packed, lut) + + # Run CPU Simulation + c_ref = tmac_true_logic_cpu(a_fp16.float().cpu(), indices.cpu(), lut.float().cpu()) + + # Compare + diff = (c_gpu.float().cpu() - c_ref).abs().max() + print(f"Max Absolute Difference: {diff:.4f}") + + if diff < 1.0: # FP16 accumulation tolerance + print("✅ PASS: GPU W4A16 implementation matches True TMAC logic.") + else: + print("❌ FAIL: Logic mismatch.") + +if __name__ == "__main__": + run_benchmark() +``` + +## 4. Performance Analysis + +### Recommended Configuration + +- **Tile Size:** block_M=64, block_N=64 provides optimal occupancy for mixed-precision workloads on C500. +- **Pipeline:** num_stages=2 efficiently hides the latency of the unpacking/lookup operations. +- **Threads:** 128 threads (4 Warps) aligns with the shared memory swizzling requirements. + +### Optimization Insights + +1. **Bandwidth Multiplier:** + - By loading int8 (packed 4-bit) instead of float16, this operator increases the effective weight loading bandwidth by **4x**. This is the primary speedup factor for memory-bound layers. +2. **L1 Cache Utilization:** + - The LUT tensor is extremely small (16 FP16 values = 32 Bytes). On MACA GPUs, this resides permanently in the **L1 Constant Cache**, making the lookup operation (LUT[idx]) virtually zero-cost compared to global memory access. +3. **Hardware Synergy:** + - While the kernel performs "LUT-based Dequantization" (soft logic), it feeds the result into **Tensor Cores** (hard logic). This hybrid approach combines the flexibility of low-bit quantization with the raw throughput of specialized hardware. \ No newline at end of file