4 comments

  • vivzkestrel 3 hours ago

    - as you know most models are trained on PDF, receipts, normal text etc

    - this however doesn't work really well for structured text like code

    - what are some absolutely state of the art self hostable OCR models out there capable of extracting code from text with very high levels of accuracy

    - I have tried tesseract currently and it is not very good with this. Even if you are not familiar with any other model, perhaps you can suggest a pipeline for tesseract that I can follow to improve the accuracy of the extraction process

    - Currently, my pipeline looks like this:

    - for every input image, check if the image is light text on dark background or dark text on light background

    - as you know tesseract is trained from mostly dark text on light background so I invert the images with dark background before processing them with tesseract

    - are there other processes you think that I need to include?

      treetalker an hour ago

      Not sure about running it in AWS, but this works well even on Intel Macs:

      https://github.com/LESIM-Co-Ltd/CoreOCR

      There are other similar wrappers for macOS Vision framework; just search on GitHub.

        vivzkestrel an hour ago

        - On quick look it seems like it has a CLI

        - my primary use case is to invoke it from a server such as python / express to perform batch recognition for images submitted via api endpoints.

        - Any ideas how much time it needs for OCRing a 1280x720 png image. Thank you for sharing that btw

          treetalker 27 minutes ago

          It is definitely run from the CLI.

          I don't know but it's pretty fast. If I understand correctly, it's using the same functionality as the OCR in Apple Notes. (And for that reason Xcode needs to be installed.)