YouTube’s auto-generated and auto-translated subtitles are notoriously unreliable. For many videos, captions are missing, delayed, or inconsistent across browsers.
I built a Chrome extension to generate subtitles directly from the audio, with features like:
- Real-time transcription of video audio
- Translation into 100+ languages
- Multiple subtitles at once (useful for language learners)
- Search inside video subtitles (like CTRL+F)
- Drag-and-drop subtitle placement for optimal viewing
- Optional: dictionary lookup, summarization, and Q&A based on video content
I’d love feedback from anyone interested in real-time transcription, accessibility, or YouTube workflow automation. How would you approach a problem like this?
I initially explored a fully client-side approach (including WebAssembly), but it didn’t work well in practice. Real-time audio transcription and multi-language translation are both compute-intensive, and browser-only solutions ran into performance and reliability limits, especially for longer videos and live streams.
Using a dedicated backend allowed more consistent latency and accuracy across browsers, and made features like multiple simultaneous subtitle languages and searchable transcripts feasible.
It’s a standard unpacked Chrome extension install (developer mode → load unpacked). Happy to answer any technical questions about the pipeline or trade-offs.
YouTube’s auto-generated and auto-translated subtitles are notoriously unreliable. For many videos, captions are missing, delayed, or inconsistent across browsers.
I built a Chrome extension to generate subtitles directly from the audio, with features like:
- Real-time transcription of video audio - Translation into 100+ languages - Multiple subtitles at once (useful for language learners) - Search inside video subtitles (like CTRL+F) - Drag-and-drop subtitle placement for optimal viewing - Optional: dictionary lookup, summarization, and Q&A based on video content
Here are short demo videos showing the extension generating subtitles from YouTube audio: https://drive.google.com/drive/folders/1I_z6HjGCVUwgYs1UXlB7...
I’d love feedback from anyone interested in real-time transcription, accessibility, or YouTube workflow automation. How would you approach a problem like this?
is this open source? Can you link a repo?
It’s not open source at the moment.
I initially explored a fully client-side approach (including WebAssembly), but it didn’t work well in practice. Real-time audio transcription and multi-language translation are both compute-intensive, and browser-only solutions ran into performance and reliability limits, especially for longer videos and live streams.
Using a dedicated backend allowed more consistent latency and accuracy across browsers, and made features like multiple simultaneous subtitle languages and searchable transcripts feasible.
If you’re curious to see how it behaves in practice, I have a temporary install available while the Chrome Web Store listing is pending: https://content.ray.techspecs.io/public/assets/apps/extensio...
It’s a standard unpacked Chrome extension install (developer mode → load unpacked). Happy to answer any technical questions about the pipeline or trade-offs.