Download Perform OCR on a Scanned PDF in Python Using borb - Python
Categories:Viewed: 53 - Published at: 6 months ago

The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines. To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language. In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

In this guide, we'll take a look at how to apply Optical Character Recognition (OCR) on a scanned PDF document.

Installing borb

borb can be downloaded from source on GitHub, or installed via pip:

$ pip install borb

“My PDF Document Has No Text!”

This is by far one of the most classic questions on any programming-forum, or helpdesk:

"My document does not seem to have text in it. Help?"

Or:

"Your text-extraction code sample does not work for my document. How come?"

The answer is often as straightforward as "your scanner hates you". Most of the documents for which this doesn't work are PDF documents that are essentially glorified images. They contain all the meta-data needed to constitute a PDF, but their pages are just large (often low-quality) images, created by scanning physical papers. As a consequence, there are no text-rendering instructions in these documents. And most PDF libraries will not be able to handle them. borb, however, loves to help and can be applied in these cases, with built-in support for OCR. In this section we'll be using a special EventListener implementation called OCRAsOptionalContentGroup. This class uses tesseract (or rather pytesseract) to perform OCR (optical character recognition) on the Document.

If you'd like to read more about OCR in Python, read our Guide to Simple Optical Character Recognition with PyTesseract!

Once finished, the recognized text is re-inserted in each Page as a special "layer" (in PDF this is called an "optional content group"). With the content now restored, the usual tricks (SimpleTextExtraction) yield the expected results. You'll start by creating a method that builds a PIL Image with some text in it. This Image will then be inserted in a PDF.

Creating an Image

import typing
from pathlib import Path

from PIL import Image as PILImage  # Type: ignore [import]
from PIL import ImageDraw, ImageFont

def create_image() -> PILImage:
    # Create new Image
    img = PILImage.new("RGB", (256, 256), color=(255, 255, 255))

    # Create ImageFont
    # CAUTION: you may need to adjust the path to your particular font directory
    font = ImageFont.truetype("/usr/share/fonts/truetype/ubuntu/UbuntuMono-B.ttf", 24)

    # Draw text
    draw = ImageDraw.Draw(img)
    draw.text((10, 10),
              "Hello World!",
              fill=(0, 0, 0),
              font=font)

    # Return
    return img

Now let's build a PDF with this image, to represent our scanned document, that isn't parsable, as it doesn't contain metadata:

import typing
# New imports
from borb.pdf.canvas.layout.image.image import Image
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF

# Main method to create the document
def create_document():

    # Create Document
    d: Document = Document()

    # Create/add Page
    p: Page = Page()
    d.append_page(p)

    # Set PageLayout
    l: PageLayout = SingleColumnLayout(p)

    # Add Paragraph
    l.add(Paragraph("Lorem Ipsum"))

    # Add Image
    l.add(Image(create_image()))

    # Write
    with open("output_001.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, d)

The resulting document should look like this:

pdf document with image

When you select the text in this document, you'll see immediately that only the top line is actually text. The rest is an Image with text (the Image you created):

image is not selectable pdf

Now, let's apply OCR to this document, and overlay actual text so that it becomes parsable:

# New imports
from pathlib import Path
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction

def apply_ocr_to_document():

    # Set up everything for OCR
    tesseract_data_dir: Path = Path("/home/joris/Downloads/tessdata-master/")
    assert tesseract_data_dir.exists()
    l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)

    # Read Document
    doc: typing.Optional[Document] = None
    with open("output_001.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])

    assert doc is not None

    # Store Document
    with open("output_002.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

You can see this created an extra layer in the PDF. This layer is named "OCR by borb", and contains the rendering instructions borb re-inserted in the Document. You can toggle the visibility of this layer (this can be handy when debugging):

hidden layer turned on borb ocr pdf

hidden layer turned off borb ocr pdf

You can see that borb re-inserted the postscript rendering command to ensure "Hello World!" is in the `Document. Let's hide this layer again.

Keep in mind OCR is a heuristic. The location and matched text may not always be 100% correct. That's just the way it goes. Typically, you'll keep the layer hidden (but selectable) so the original image is in place, and you can select/copy an approximation of it.

Now (even with the layer hidden), you can select the text:

ocr text is selectable, even when invisible pdf borb

And if you apply SimpleTextExtraction now, you should be able to retrieve all the text in the Document.

# New imports
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction

def read_modified_document():

    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("output_002.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])

    print(l.get_text_for_page(0))


def main():
    create_document()
    apply_ocr_to_document()
    read_modified_document()


if __name__ == "__main__":
    main()

This prints:

Lorem Ipsum
Hello World!

Awesome!

Conclusion

In this guide you've learned how to apply OCR to PDF documents, ensuring your scanned documents are searchable and ready for future processing.

Reference: stackabuse.com

TAGS :