I Beat Nvidia NCCL by 2.4x

1 points | by venkat_2811 2 hours ago

2 comments

venkat_2811 2 hours ago
100% OSS, MIT License. YALI - Yet Another Low-Latency Implementation. Achieves 80-85% Speed-of-Light SW efficiency by using ultra low-latency primitives for p2p all_reduce_sum comms collective. Very important operation in multi-gpu llm training and inference
[-]