1 comments

  • ottoselymesi 3 days ago

    OP here. Wrote this to handle ragged/irregular data without padding or sorting. Instead of "one thread per stream" (divergence hell), it uses a holistic grid-stride traversal.

    Benchmarks on GTX 1070 (Pascal): Ragged Reduction: ~2.45x faster than baseline. Nested Analytics: ~1.98x faster (single-pass).

    Header-only C++17. Happy to answer questions.