Hey HN! I'm a student who built this over the past 5 months.
Why I built this:Every project I worked on hit the same wall: I couldn't use real data due to HIPAA/GDPR, public datasets were too generic, and mocking data manually was painful. Existing tools like Gretel or Tonic are enterprise-priced and closed-source.
So I built an open-source alternative that does two things:
Schema mode: Define columns and generate upto 1M rows (no training data needed).
ML mode: Upload a CSV to train CTGAN/TVAE/Copula and generate high-fidelity synthetic data.
Hardest technical challenge:Getting differential privacy parameters right. The $\epsilon$ (epsilon) budget directly trades off between privacy and utility. Too strict makes the data useless; too loose causes privacy leaks. I ended up exposing this as a configurable slider with sensible defaults and documentation.
Pricing/Openness:100% MIT licensed (fork it, host it, modify it).
Self-host: docker-compose up and you're running.
No tracking or data collection on self-hosted instances.
Hey HN! I'm a student who built this over the past 5 months. Why I built this:Every project I worked on hit the same wall: I couldn't use real data due to HIPAA/GDPR, public datasets were too generic, and mocking data manually was painful. Existing tools like Gretel or Tonic are enterprise-priced and closed-source.
So I built an open-source alternative that does two things: Schema mode: Define columns and generate upto 1M rows (no training data needed). ML mode: Upload a CSV to train CTGAN/TVAE/Copula and generate high-fidelity synthetic data.
Tech stack: Frontend: Next.js 15, TypeScript, Tailwind
Backend: FastAPI, PostgreSQL, Redis
ML: SDV library (CTGAN, TVAE, GaussianCopula)
Privacy: Differential privacy using $(\epsilon, \delta)$-probabilistic guarantees.
Auth: Better Auth (self-hosted) Deployment: Docker Compose
Hardest technical challenge:Getting differential privacy parameters right. The $\epsilon$ (epsilon) budget directly trades off between privacy and utility. Too strict makes the data useless; too loose causes privacy leaks. I ended up exposing this as a configurable slider with sensible defaults and documentation.
Pricing/Openness:100% MIT licensed (fork it, host it, modify it). Self-host: docker-compose up and you're running. No tracking or data collection on self-hosted instances.
Try it out:Live playground (no signup): https://www.synthdata.studio/playground
GitHub:https://github.com/Urz1/synthetic-data-studio
I’d love to hear your feedback on the architecture, privacy implementation, or what features would make this useful for your workflow!