OCR my PDF
On a regular basis I have to scan some paper documents. For the scanning I still use xsane (X Scanner Access Now Easy) for that. Afterwards I used GIMP (GUN Image Manipulation Program) to do the post processing:
- adjust black / white balance to make the background white
- de-speckle some areas
This normally reduced the file size by a factor of 10 from 10 MiB per page to roughly 1 MiB per page. While it worked for me it was a lot of work, which took its time.
During the last Mini DebConf Regensburg 2021 I learned about OpenPaper.work. I use it to archive my PDFs, which allows me to tag and find them. Under the hood it uses Tesseract OCR to convert the pixel-PDFs into searchable PDFs. But it does not handle the shrinking part.
So I continued my journey and found OCRmyPDF.
Under the hood it uses unpaper to do some post-processing.
Afterwards tesseract
is used to do the OCR.
By default it preserved the original PDF, but you can also use it to create a minimized PDF like this:
ocrmypdf -l deu --remove-background --deskew --clean --clean-final -O 2 inpud.pdf output.pdf