1 comments

  • 0x0funky 2 hours ago

    I built this because Meta’s SAM-Audio (Segment Anything for Audio) is a breakthrough for interactive sound separation, but the original implementation is quite heavy, often requiring 30GB+ VRAM due to the default loading of Vision Encoders and Rankers.

    The Problem: Beyond the VRAM barrier, the Windows installation is a "dependency hell" due to mismatched FFmpeg and TorchCodec DLLs.

    My Approach (The "Lite Mode"):

    · Memory Trimming: I modified the model initialization to strip the Vision Encoder and various rerankers for pure audio tasks. This brings the footprint down to ~6GB VRAM for the Small model (bfloat16).

    · Automated Setup: Bundled a install.bat that pins compatible versions of PyTorch and FFmpeg to ensure it works on Windows 11 immediately.

    · Architecture: Built with a Next.js (Tailwind v4) frontend and a FastAPI/Celery backend to provide a modern interface over the CLI.

    Everything is open-source (MIT). I hope this makes professional-grade audio separation accessible to those with consumer-grade hardware like the RTX 3060/4060.

    I'm curious to hear from anyone testing this on different GPU architectures!