I built this because Meta’s SAM-Audio (Segment Anything for Audio) is a breakthrough for interactive sound separation, but the original implementation is quite heavy, often requiring 30GB+ VRAM due to the default loading of Vision Encoders and Rankers.
The Problem:
Beyond the VRAM barrier, the Windows installation is a "dependency hell" due to mismatched FFmpeg and TorchCodec DLLs.
My Approach (The "Lite Mode"):
· Memory Trimming: I modified the model initialization to strip the Vision Encoder and various rerankers for pure audio tasks. This brings the footprint down to ~6GB VRAM for the Small model (bfloat16).
· Automated Setup: Bundled a install.bat that pins compatible versions of PyTorch and FFmpeg to ensure it works on Windows 11 immediately.
· Architecture: Built with a Next.js (Tailwind v4) frontend and a FastAPI/Celery backend to provide a modern interface over the CLI.
Everything is open-source (MIT). I hope this makes professional-grade audio separation accessible to those with consumer-grade hardware like the RTX 3060/4060.
I'm curious to hear from anyone testing this on different GPU architectures!
I built this because Meta’s SAM-Audio (Segment Anything for Audio) is a breakthrough for interactive sound separation, but the original implementation is quite heavy, often requiring 30GB+ VRAM due to the default loading of Vision Encoders and Rankers.
The Problem: Beyond the VRAM barrier, the Windows installation is a "dependency hell" due to mismatched FFmpeg and TorchCodec DLLs.
My Approach (The "Lite Mode"):
· Memory Trimming: I modified the model initialization to strip the Vision Encoder and various rerankers for pure audio tasks. This brings the footprint down to ~6GB VRAM for the Small model (bfloat16).
· Automated Setup: Bundled a install.bat that pins compatible versions of PyTorch and FFmpeg to ensure it works on Windows 11 immediately.
· Architecture: Built with a Next.js (Tailwind v4) frontend and a FastAPI/Celery backend to provide a modern interface over the CLI.
Everything is open-source (MIT). I hope this makes professional-grade audio separation accessible to those with consumer-grade hardware like the RTX 3060/4060.
I'm curious to hear from anyone testing this on different GPU architectures!