diff --git a/docs/deeplearning_operators/tmac_gpu.md b/docs/deeplearning_operators/tmac_gpu.md
index 18d73fd5ac815a013e1b4341fe87c450abd2fd22..7c285db47f235c3eac446ca3e6fdc9fcf768a36a 100644
--- a/docs/deeplearning_operators/tmac_gpu.md
+++ b/docs/deeplearning_operators/tmac_gpu.md
@@ -1,2 +1,265 @@
TMAC: Look Up Table Based Mixed Precision Computing
====================================================
+
+
+ Author: zhangzhangJ
+
+
+:::{warning}
+:class: myclass1 myclass2
+:name: a-tip-reference
+
+ This document describes the **TMAC (Look-Up Table Based Mixed Precision Computing)** operator implementation on MACA GPUs.
+
+ **Architecture Note:**
+ * **Theoretical TMAC:** Replaces multiplication entirely with LUT lookups (Ideal for CPU AVX-512 / FPGA).
+ * **GPU TMAC (This Operator):** Implemented as **W4A16 Dequantization**. It uses LUTs to decode 4-bit weights on-the-fly to save 4x bandwidth, while leveraging **Tensor Cores** for high-throughput matrix multiplication.
+:::
+
+**TMAC** is a high-performance operator designed to accelerate quantized neural networks. By storing weights in a compressed 4-bit format and decoding them using a **Look-Up Table (LUT)** during computation, it significantly reduces the "Memory Wall" bottleneck in Large Language Model (LLM) inference.
+
+---
+
+## 1. Operator Function
+
+### Application Scenarios
+* **LLM Quantization (W4A16):** The industry standard for deploying 4-bit quantized models (e.g., LLaMA-3, Mixtral) on GPUs. It enables running larger models on devices with limited Video RAM (VRAM) and bandwidth.
+* **Memory-Bound Linear Layers:** Accelerates Fully Connected (FC) layers in Transformers where arithmetic intensity is low, and performance is strictly limited by weight loading speed.
+
+### Core Calculation Logic
+The implementation of TMAC varies significantly depending on the underlying hardware architecture.
+
+#### 1. Theoretical Foundation ("True TMAC")
+
+* **Target Hardware:** CPUs (AVX-512), FPGAs, PIM (Processing-in-Memory).
+* **Concept:** Eliminate expensive multiplication entirely.
+* **Process:**
+ 1. **Pre-computation:** For an activation $A$, compute a temporary vector containing the product of $A$ with all possible 4-bit weights ($0..15$).
+ 2. **Lookup-Accumulate:** Use the 4-bit weight index $B$ to *select* (shuffle) the result from the temporary vector and accumulate it.
+ 3. **Formula:** $C += \text{LUT}_A[B]$ (No Multiplication).
+
+#### 2. GPU Adaptation ("TMAC Variant")
+
+* **Target Hardware:** MACA C500, NVIDIA GPUs.
+* **Constraint:** GPUs derive 90%+ of their peak throughput from **Tensor Cores**, which are fixed-function **Matrix-Multiply-Accumulate (MMA)** units. They do not support pure lookup-based computation efficiently.
+* **Implementation Strategy (W4A16 Dequantization):**
+ We adopt a hybrid approach: use LUTs to solve the **Bandwidth Bottleneck**, but use Tensor Cores to solve the **Compute Bottleneck**.
+ 1. **Compressed Loading:** Load Packed Int8 weights (saving 75% bandwidth).
+ 2. **On-Chip Dequantization:** Use LUTs in Shared Memory to map 4-bit indices back to FP16 values ($W = \text{LUT}[B]$).
+ 3. **Tensor Core Compute:** Perform standard matrix multiplication ($C += A \times W$) using the reconstructed weights.
+
+### 💡 Design Rationale: Why "Variant" on GPU?
+* **True TMAC (CPU/FPGA):** Eliminates multiplication by using Shuffle/Lookup instructions.
+* **GPU Reality:** Modern GPUs derive 90%+ of their peak throughput from **Tensor Cores**, which are fixed-function Matrix-Multiply-Accumulate (MMA) units.
+* **The Optimal Strategy:** We use LUTs to solve the **Bandwidth Bottleneck** (W4A16 decoding) but retain the **Compute Power** of Tensor Cores. Implementing "True TMAC" (pure lookup) on GPUs would force the use of slower CUDA Cores, resulting in a 10x-50x performance drop.
+
+---
+
+## 2. Interface Parameters
+
+### Input Parameters
+
+* **`A` (Activations)**
+ * **Shape:** `(M, K)`
+ * **Data Type:** `float16`
+ * **Description:** The input feature matrix.
+* **`B_packed` (Compressed Weights)**
+ * **Shape:** `(K, N // 2)`
+ * **Data Type:** `int8`
+ * **Description:** 4-bit quantized weight matrix packed into 8-bit integers. **Physical N is halved** because each byte stores two 4-bit weights.
+* **`LUT` (Quantization Codebook)**
+ * **Shape:** `(16, )`
+ * **Data Type:** `float16`
+ * **Description:** A look-up table containing the 16 representative FP16 values for the 4-bit indices ($0..15$).
+
+### Output Parameters
+
+* **`C` (Result)**
+ * **Shape:** `(M, N)`
+ * **Data Type:** `float16`
+ * **Description:** The result of the matrix multiplication $C = A \times \text{Dequant}(B)$.
+
+### Optional Parameters (Configuration)
+
+* **`block_M` / `block_N`**: Tile size for the output dimension (default: 64).
+* **`block_K`**: Tile size for the reduction dimension (default: 32).
+* **`num_stages`**: Software pipeline depth to hide memory latency (default: 2).
+* **`threads`**: Number of threads per block (default: 128).
+
+---
+
+## 3. Usage Example & Verification
+
+The following Python script uses `mc_tilelang` to implement the TMAC kernel.
+
+### Verification Methodology
+To ensure the mathematical correctness of the GPU W4A16 kernel, we establish a **Ground Truth Baseline** using CPU simulation:
+1. **CPU Baseline ("True TMAC"):** A pure Python implementation that strictly follows the theoretical $C += A \times LUT[B]$ logic (Lookup-Accumulate) using FP32 precision. This serves as the mathematical golden standard.
+2. **GPU Target ("TMAC Variant"):** The high-performance W4A16 TileLang kernel.
+3. **Comparison:** We verify that the GPU kernel produces results identical to the theoretical pure-LUT logic (within FP16 tolerance).
+
+### Complete Code
+
+```python
+import torch
+import tilelang
+import tilelang.language as T
+import time
+
+# =================================================================
+# 1. GPU Kernel: TMAC W4A16 Implementation (MACA Optimized)
+# =================================================================
+def tmac_gpu_w4a16(M, N, K, block_M=64, block_N=64, block_K=32, num_stages=2, threads=128):
+ """
+ TMAC GPU Kernel.
+ Logic: Packed Load -> Bitwise Unpack -> LUT Dequant -> Tensor Core GEMM.
+ """
+ # Physical width is half of logical width (4-bit packing)
+ N_packed = N // 2
+ block_N_packed = block_N // 2
+ total_steps = (K + block_K - 1) // block_K
+
+ @T.prim_func
+ def main(
+ A: T.Tensor((M, K), "float16"),
+ B_packed: T.Tensor((K, N_packed), "int8"),
+ LUT: T.Tensor((16, ), "float16"),
+ C: T.Tensor((M, N), "float16"),
+ ):
+ with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by):
+ # Shared Memory Allocations
+ A_shared = T.alloc_shared((block_M, block_K), "float16")
+ B_shared_packed = T.alloc_shared((block_K, block_N_packed), "int8")
+ B_shared_fp16 = T.alloc_shared((block_K, block_N), "float16")
+ C_local = T.alloc_fragment((block_M, block_N), "float32")
+
+ T.clear(C_local)
+
+ # Pipelined Loop
+ for k in T.Pipelined(0, total_steps, num_stages=num_stages):
+ # 1. Global -> Shared (High Bandwidth Efficiency)
+ T.copy(A[by * block_M, k * block_K], A_shared)
+ T.copy(B_packed[k * block_K, bx * block_N_packed], B_shared_packed)
+
+ # 2. TMAC Logic: Unpack & Lookup in Shared Memory
+ # This prepares valid FP16 data for Tensor Cores
+ for i, j in T.Parallel(block_K, block_N_packed):
+ packed_byte = B_shared_packed[i, j]
+
+ # Unpack Lower 4 bits: val & 0x0F
+ idx_low = packed_byte & 15
+ B_shared_fp16[i, 2*j] = LUT[idx_low]
+
+ # Unpack Upper 4 bits: (val >> 4) & 0x0F
+ # Masking is required to handle sign extension
+ idx_high = (packed_byte >> 4) & 15
+ B_shared_fp16[i, 2*j + 1] = LUT[idx_high]
+
+ # 3. Compute (Tensor Core MMA)
+ T.gemm(A_shared, B_shared_fp16, C_local)
+
+ # Store Result
+ T.copy(C_local, C[by * block_M, bx * block_N])
+
+ return main
+
+# =================================================================
+# 2. CPU Logic: True TMAC Simulation (For Verification)
+# =================================================================
+def tmac_true_logic_cpu(A, B_indices, LUT):
+ """
+ Simulates 'True TMAC' math: Output += A * LUT[Index].
+ This verifies the mathematical correctness of the Dequant+GEMM approach.
+ """
+ # 1. Lookup / Shuffle (Vectorized on CPU)
+ # B_indices are 0-15 integers
+ decoded_W = LUT[B_indices.long()]
+
+ # 2. Accumulate
+ # In hardware, this is done without forming the full W matrix
+ return torch.matmul(A, decoded_W)
+
+# =================================================================
+# 3. Benchmark & Verification
+# =================================================================
+def run_benchmark():
+ M, N, K = 2048, 2048, 2048
+ print(f"=== TMAC Benchmark (W4A16): M={M}, N={N}, K={K} ===")
+
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ if device == "cpu":
+ print("Error: No GPU found.")
+ return
+
+ torch.manual_seed(0)
+
+ # --- Data Preparation ---
+ # 1. Activations
+ a_fp16 = torch.randn(M, K, device=device, dtype=torch.float16)
+
+ # 2. Weights (4-bit Indices)
+ indices = torch.randint(0, 16, (K, N), device=device, dtype=torch.uint8)
+
+ # 3. LUT (Codebook)
+ lut = torch.randn(16, device=device, dtype=torch.float16)
+
+ # 4. Pack Indices for GPU (int8)
+ indices_even = indices[:, 0::2]
+ indices_odd = indices[:, 1::2]
+ b_packed = (indices_even | (indices_odd << 4)).to(torch.int8)
+
+ # --- Compilation ---
+ print("Compiling TileLang Kernel...")
+ func = tmac_gpu_w4a16(M, N, K)
+ # Automatically detects MACA/CUDA target
+ kernel = tilelang.compile(func, out_idx=[3])
+
+ # --- Performance Test ---
+ print("Benchmarking GPU Performance...")
+ # Warmup
+ kernel(a_fp16, b_packed, lut)
+ torch.cuda.synchronize()
+
+ profiler = kernel.get_profiler()
+ latency = profiler.do_bench()
+ tflops = (2 * M * N * K) / (latency * 1e-3) / 1e12
+ print(f"Latency: {latency:.3f} ms")
+ print(f"Throughput: {tflops:.2f} TFLOPS")
+
+ # --- Correctness Verification ---
+ print("\nVerifying against CPU True TMAC Logic...")
+ # Run GPU Kernel
+ c_gpu = kernel(a_fp16, b_packed, lut)
+
+ # Run CPU Simulation
+ c_ref = tmac_true_logic_cpu(a_fp16.float().cpu(), indices.cpu(), lut.float().cpu())
+
+ # Compare
+ diff = (c_gpu.float().cpu() - c_ref).abs().max()
+ print(f"Max Absolute Difference: {diff:.4f}")
+
+ if diff < 1.0: # FP16 accumulation tolerance
+ print("✅ PASS: GPU W4A16 implementation matches True TMAC logic.")
+ else:
+ print("❌ FAIL: Logic mismatch.")
+
+if __name__ == "__main__":
+ run_benchmark()
+```
+
+## 4. Performance Analysis
+
+### Recommended Configuration
+
+- **Tile Size:** block_M=64, block_N=64 provides optimal occupancy for mixed-precision workloads on C500.
+- **Pipeline:** num_stages=2 efficiently hides the latency of the unpacking/lookup operations.
+- **Threads:** 128 threads (4 Warps) aligns with the shared memory swizzling requirements.
+
+### Optimization Insights
+
+1. **Bandwidth Multiplier:**
+ - By loading int8 (packed 4-bit) instead of float16, this operator increases the effective weight loading bandwidth by **4x**. This is the primary speedup factor for memory-bound layers.
+2. **L1 Cache Utilization:**
+ - The LUT tensor is extremely small (16 FP16 values = 32 Bytes). On MACA GPUs, this resides permanently in the **L1 Constant Cache**, making the lookup operation (LUT[idx]) virtually zero-cost compared to global memory access.
+3. **Hardware Synergy:**
+ - While the kernel performs "LUT-based Dequantization" (soft logic), it feeds the result into **Tensor Cores** (hard logic). This hybrid approach combines the flexibility of low-bit quantization with the raw throughput of specialized hardware.
\ No newline at end of file