MetaXuda – 1.1 Tops GPU Runtime for Apple Silicon ML (Rust and Metal)

2 points | by perinban 3 hours ago

1 comments

perinban 3 hours ago
Hey HN! I built MetaXuda after getting tired of "buy Windows for ML" advice when working on Apple Silicon.
Problem: Most ML libraries (XGBoost, scikit-learn) are CUDA-only with zero macOS GPU support. Existing translation layers (ZLUDA) add overhead.
Solution: Native Rust + Metal runtime from scratch.
Key features: - 1.1 TOPS throughput (95% of M1 theoretical peak) - Tokio async scheduler with zero race conditions - Multi-tier memory: GPU → RAM → SSD (handles 100GB+ workloads) - 230+ GPU ops (math, transform, ML primitives) - CUDA-style APIs for easy library integration - Bypasses Numba execution path
Technical approach: - No CUDA/ZLUDA reuse (licensing + perf reasons) - PyO3 wrapper for Python - Arrow-based quantization in-kernel - 93.37% GPU cap to prevent macOS starvation
Known limitations: - Metal stream limits still undocumented by Apple - CUDA API coverage incomplete (in progress) - Some blocking favors stability over raw speed
pip install metaxuda
Open to questions on Metal vs CUDA architecture, Rust async patterns, or Apple GPU quirks. Also looking for feedback on scheduler design.
License inquiries: p.perinban@gmail.com