linux

OCR my PDF

On a regular basis I have to scan some paper documents. For the scanning I still use xsane (X Scanner Access Now Easy) for that. Afterwards I used GIMP (GUN Image Manipulation Program) to do the post processing:

adjust black / white balance to make the background white
de-speckle some areas

This normally reduced the file size by a factor of 10 from 10 MiB per page to roughly 1 MiB per page. While it worked for me it was a lot of work, which took its time.

During the last Mini DebConf Regensburg 2021 I learned about OpenPaper.work. I use it to archive my PDFs, which allows me to tag and find them. Under the hood it uses Tesseract OCR to convert the pixel-PDFs into searchable PDFs. But it does not handle the shrinking part.

So I continued my journey and found OCRmyPDF. Under the hood it uses unpaper to do some post-processing. Afterwards tesseract is used to do the OCR.

By default it preserved the original PDF, but you can also use it to create a minimized PDF like this:

ocrmypdf -l deu --deskew --clean --clean-final -O 2 inpud.pdf output.pdf # --remove-background

Written on July 7, 2022