LoPA: Scaling Diffusion LLM Single-Sample Throughput to 1000 TPS

6 points | by zhijied 2 hours ago

1 comments

zhijied 2 hours ago
LoPA: A training-free algorithm that breaks the speed limit of dLLMs. LoPA Scales Diffusion LLM Inference to 10.1 TPF and 1000+ TPS! 10.1 Tokens Per Forward pass (TPF) on GSM8K. 1073.9 tokens/s throughput on multi-device systems. SOTA speed without retraining.
LoPA-Dist: Engineered for Scale Algorithm is only half the battle. We built LoPA-Dist with Branch Parallelism (BP) to handle the load: - NVIDIA GPUs: Implements a two-phase update protocol (Pre-Write / Commit-Winner) to ensure KV cache consistency. - Ascend 910C: Utilizes Graph Compilation and Block-wise masking for high-throughput serving.