Show HN: LaTeX → structured ArXiv data for scientific RAG

2 points | by cjlooi a day ago

1 comments

cjlooi a day ago
PDF-based pipelines are fundamentally lossy and compute-heavy—whether they rely on OCR, GROBID, or LLM-based parsing. They're simply not good enough for accurate, scientific agents at scale.
To fix this, I'm launching ScienceStack API: a lossless, node-based API for scientific papers with LaTeX source, starting with arXiv.
It currently covers 150k+ arXiv papers, mainly in CS, Math, and Physics.
Every paper also ships with a WYSIWYG interactive reader at sciencestack.ai/paper/{arxivId}. Example: https://www.sciencestack.ai/paper/2512.24601v1
I’m giving away 5× 3-month Pro keys to early commenters who are building in this space (scientific tooling, agents, copilots, RAG etc). I’d love to hear what you’re working on