1 comments

  • aashirpersonal 2 hours ago

    Hi HN,

    I’ve been building RAG systems for a while, and I noticed 90% of retrieval failures aren't due to the LLM—they're due to the data. I got tired of debugging hallucinations only to find the retriever had pulled "Page 1 of 5" headers or five duplicate versions of an old policy.

    I couldn't find a simple "pandas-profiling" equivalent for unstructured text, so I built this.

    It runs locally (CLI) and helps you:

    Detect semantic duplicates (using all-MiniLM-L6-v2) to save vector storage costs.

    Flag PII (API keys, emails) before they get indexed.

    Identify "coverage gaps" by comparing user queries against your docs.

    It outputs a standalone HTML report you can show to stakeholders.

    Written in Python, open source (MIT). Feedback welcome!

    https://github.com/aashirpersonal/rag-corpus-profiler