LoPA: A training-free algorithm that breaks the speed limit of dLLMs. LoPA Scales Diffusion LLM Inference to 10.1 TPF and 1000+ TPS!
10.1 Tokens Per Forward pass (TPF) on GSM8K.
1073.9 tokens/s throughput on multi-device systems.
SOTA speed without retraining.
LoPA-Dist: Engineered for Scale
Algorithm is only half the battle. We built LoPA-Dist with Branch Parallelism (BP) to handle the load:
- NVIDIA GPUs: Implements a two-phase update protocol (Pre-Write / Commit-Winner) to ensure KV cache consistency.
- Ascend 910C: Utilizes Graph Compilation and Block-wise masking for high-throughput serving.
LoPA: A training-free algorithm that breaks the speed limit of dLLMs. LoPA Scales Diffusion LLM Inference to 10.1 TPF and 1000+ TPS! 10.1 Tokens Per Forward pass (TPF) on GSM8K. 1073.9 tokens/s throughput on multi-device systems. SOTA speed without retraining.
LoPA-Dist: Engineered for Scale Algorithm is only half the battle. We built LoPA-Dist with Branch Parallelism (BP) to handle the load: - NVIDIA GPUs: Implements a two-phase update protocol (Pre-Write / Commit-Winner) to ensure KV cache consistency. - Ascend 910C: Utilizes Graph Compilation and Block-wise masking for high-throughput serving.