Python ext:pdf – PDF extensions in Python

October 8, 2023
By Schenia T
Reading Time: 11 minutes
Python

The PDF extensions libraries in Python (ext:pdf) allow you to work with PDF files. In this way, it allows you not only to read and write PDF files, but also to manipulate their contents, such as adding, removing and changing pages, form fields and metadata. Furthermore, the library also allows us to convert PDF files into other formats, such as images and texts.

PDF files are one of the most popular file forms for documents, both for personal and professional use. And as developers, it is often necessary to work with these files in our applications. Python, fortunately, has some powerful and easy-to-use libraries for dealing with PDF files.

In this article, we will explore how to use libraries to work with PDF files in Python. Let’s see how to install and import the library, how to create and manipulate PDF files, and how to use some of the library’s more advanced features.

Table of Contents

Popular PDF extension libraries in Python

There are several popular PDF extension libraries for Python, each with its own functionality and application in different contexts. Here are some of the most popular libraries, remembering that we can use other libraries as we will see later:

PyPDF2 : It is a lightweight and easy-to-use library for manipulating PDFs in Python. Therefore, it provides functions for reading and writing PDFs, as well as adding, removing and manipulating pages.
pdfminer : It is a PDF processing library that allows you to extract information from PDFs, such as text, layout, images and annotations. Also used to identify and separate different parts of a PDF, such as covers, pages and attachments.
pdfquery : It is a PDF query library that allows you to perform SQL queries on PDFs. Thus, we convert PDFs into a form tabularso that queries can be performed on top of them.
pdfkit : This is a PDF management library that allows you to create, read, manipulate and write PDFs in Python. In this sense, including functions to add, remove, convert PDFs into other formats, such as Images and manipulate pages, as well as to sign and protect PDFs.
reportlab : It is a report generation library that allows you to create complex PDFs from dynamic data. Therefore, it includes functions for creating tables, graphs, images and text, as well as supporting layout and style customization.
pstoedit : It is a PDF editing library, including functions to add, remove and manipulate pages, as well as to change text and images in PDFs.
pdf-reactor : It is a PDF processing library that allows you to manipulate PDFs in Python. Thus, including functions for adding, removing and manipulating pages, as well as extracting information from PDFs, such as text and annotations.

Features and Benefits of PDF Extensions in Python

PDF extensions are one of the most popular features of Python, a high-level, interpreted programming language . Thus, these pdf extensions in python allow developers to easily create, trim, and edit PDF files as well as convert PDF files to other file formats.

Here are some of the main features and benefits of PDF extensions in Python:

1. Create PDF files

PDF extensions in python allow developers to create PDF files from scratch. In this sense, we use a library reportlab, which is one of the main PDF generation libraries in Python. With this library, developers can create pages, add text and images, define layouts and styles, among other features.

Example of how to create a PDF file using a library reportlabin Python:

import reportlab.lib.pagesizes as pagesizes
from reportlab.pdfgen import canvas

# Create a PageSize object
page_size = pagesizes.letter()

# Create a Canvas object
canvas = canvas.Canvas('example.pdf')

# set page size
canvas.setPageSize(page_size)

# add text to page
text = 'Hello, world!'
canvas.drawString(100, 750, text)

# add an image to the page
image = 'example.jpg'
canvas.drawImage(image, (100, 500))

# add a line to the page
canvas.drawLine(100, 250, 300, 250)

# add a rectangle to the page
canvas.drawRect(100, 150, 300, 50)

# close the PDF file
canvas.showPage()
canvas.save()

This example creates a PDF file called “example.pdf” with a letter-sized page (21.59 cm x 27.94 cm), with text, an image, and a rectangle drawn on the page.

Therefore, we use the object PageSize to set the page size, and we use the object Canvasto create the page and add elements to it, and we use the method saveto save the PDF file.

2. Reading PDF files

PDF extensions also allow developers to read and analyze existing PDF files. In this sense, we can be using a library pyPDF2, which is one of the main PDF reading libraries in Python. With this library, developers can access and manipulate the content of a PDF file, such as text, images, and metadata.

Now let’s see an example of how to read a PDF file using a library pyPDF2 in Python, opening a PDF file called “example.pdf” and then reading the number of pages it has. It then reads the contents of the first page of the PDF file and prints it to standard output. See below:

import pyPDF2

# Open a PDF file
with open('example.pdf', 'rb') as f:
    # Create a PDFFile object
    pdf_file = pyPDF2.PDFFile(f)

# Read the number of pages in the PDF file
page_count = pdf_file.getNumPages()
print(f'Number of pages: {page_count}')

# Read the content of the first page
page_content = pdf_file.getPage(0).extractText()
print(page_content)

The object PDFFile is used to open the PDF file and access its pages and content. The method getNumPages is used to read the number of pages in the PDF file, and the method getPage is used to read the content of the first page. The method extractText is used to extract text from the page.

3. Editing PDF files

In addition to creating and reading PDF files, python PDF extensions also allow developers to edit existing PDF files. Therefore, we use a library pdftotext, which allows developers to extract texts from PDF files and convert them to plain text formats, such as plain text format.

In the example below, we will see how to edit a PDF file using a library pyPDF2 in Python. First, open a PDF file called “example.pdf,” add a new page to the end of the file, and add text and an image to the new page. Then it saved the edited PDF file as “edited_example.pdf”.

import pyPDF2

# Open a PDF file
with open('example.pdf', 'rb') as f:
    # Create a PDFFile object
    pdf_file = pyPDF2.PDFFile(f)

# Add a new page to the PDF file
pdf_file.addPage(pyPDF2.Page(100, 100))

# Add text to new page
text = 'This is a new page!'
pdf_file.getPage(1).drawString(50, 50, text)

# Add an image to the new page
image = 'example.jpg'
pdf_file.getPage(1).drawImage(image, (100, 100))

# Save the edited PDF file
pdf_file.save('edited_example.pdf')

The object PDFFile is used to open the PDF file and add a new page to the end of the file. The method addPage is used to add a new page, and the method drawString is used to add text to the page. The method drawImage is used to add an image to the page. Finally, the method save is used to save the edited PDF file.

4. PDF file conversion

PDF extensions also allow developers to convert PDF files to other file formats. Thus, we are using a library pdf2image, which allows developers to convert pages of a PDF file into raster images, such as JPEG or PNG.

In this example of how to convert a PDF file to a text file using a library pyPDF2 in Python, we open a PDF file called “example.pdf” and extract the text from all pages of the PDF file. It then saves the text to a text file called “example.txt”. Look:

import pyPDF2

# Open a PDF file
with open('example.pdf', 'rb') as f:
    # Create a PDFFile object
    pdf_file = pyPDF2.PDFFile(f)

# Extract text from PDF file
text = ''
for page in pdf_file.pages:
    text += page.extractText()

# Save the text to a text file
with open('example.txt', 'w') as f:
    f.write(text)

The object PDFFile is used to open the PDF file and access its pages. The method extractText is used to extract the text from each page of the PDF file. The text is saved to a variable and then saved to a text file using the write object method open.

5. Integration with other technologies

We also use PDF extensions integrated with other technologies, such as Django , a web development framework for Python, or Selenium , a test automation library for Python. In this way, this integration allows developers to create customized solutions for their specific needs.

Suppose we want to create a document management system that allows users to upload PDF files, extract information from them, and store them in a database. To do this, we can use a library pyPDF2 to handle the PDF files and a database, such as MySQL or MongoDB , to store the extracted information.

Here is an example:

import pyPDF2
import mysql.connector

# Create a connection to the database
cnx = mysql.connector.connect(
    user='user',
    password='password',
    host='localhost',
    database='data_bank'
)

# Create a table in the database to store the extracted information
cursor = cnx.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS pdf_info (id INT PRIMARY KEY, name VARCHAR(255), data TEXT)')

# Open a PDF file and extract sensitive information
with open('example.pdf', 'rb') as f:
    pdf_file = pyPDF2.PDFFile(f)
    name = pdf_file.getTitle()
    data = pdf_file.getPage(0).extractText()

# Save the extracted information in the database
cursor.execute('INSERT INTO pdf_info (name, data) VALUES (%s, %s)', (name, data))
cnx.commit()

# Close the connection to the database
cnx.close()

This example applies a library mysql.connectorto connect to a MySQL database and create a table to store information extracted from PDF files.

How to create a PDF file from a data model using PDF extension libraries in Python

To create a PDF file from a data model in Python, we use the PDFkit Python library.

PDF kit is a Python library that allows you to create PDFs from HTML data, text, images and other file formats. Thus, providing a wide range of features to customize PDF content and appearance, including support for tables, images, lines, forms, annotations, and more.

Here is an example of how to create a PDF file from a data model in Python using PDFkit:

First, we install a PDFkit library. Therefore, we do this as follows with the pip command:

pip install pdfkit

Next, we need to import the PDFkit library into Python code:

from pdfkit import PDFKit

Now, we create a PDFKit object and provide the data we want to include in the PDF. For example, when we have a data dictionary with information about a set of products, we can create a PDFKit and add this data to the PDF:

import pdfkit

# Create a dictionary with information about products
products = {
    "Product 1": {
        "Name": "Product 1",
        "Price": 19.99,
        "Description": "This is product 1"
    },
    "Product 2": {
        "Name": "Product 2",
        "Price": 29.99,
        "Description": "This is product 2"
    },
    "Product 3": {
        "Name": "Product 3",
        "Price": 39.99,
        "Description": "This is product 3"
    }
}

# Create a PDFKit
pdf = pdfkit.PDFKit()

# Add a page to PDF
pdf.add_page()

# Add a table to the page
table = pdf.add_table(10, 10, 100, 100)

# Add product information to the table
for product, information in products.items():
    table.add_row()
    table.add_cell(product)
    table.add_cell(information["Name"])
    table.add_cell(information["Price"])
    table.add_cell(information["Description"])

# Save the PDF
pdf.save("products.pdf")

Now, we can add more information to the PDF such as images, links, forms, annotations, etc.

Adding images, we apply the add_image()object method PDFKit. For example:

pdf.add_image("path/to/image.jpg")

With links, we use the add_link()object method PDFKit. For example:

pdf.add_link("http://www.example.com", "Link to website")

For forms, we use the add_form()object method PDFKit. For example:

pdf.add_form(fields=[
    {"name": "Name", "type": "text"},
    {"name": "Email", "type": "email"},
    {"name": "Telão", "type": "number"}
])

To add annotations, we apply the add_annotation()object method PDFKit. For example:

pdf.add_annotation(text="This is an annotation example")

In the end, we save the PDF using the save() object method PDFKit. For example:

pdf.save("file_name.pdf")

This is the basic way to create a PDF using PDFKit in Python. In this sense, we recommend consulting the official PDFKit documentation to learn more about the functionalities and resources available.

Adding metadata to a PDF file in Python

Now let’s learn how to add metadata to PDF files in Python with different libraries.

Adding metadata to a PDF file in Python

To add metadata to a PDF file in Python, we use the PyPDF2. This library allows you to read and write PDF files and also allows you to add, change and remove metadata.

Here is an example of how to add metadata to a PDF file using PyPDF2:

import PyPDF2

# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
    # Create a PyPDF2.PdfFileReader object to read the PDF file
    pdf_reader = PyPDF2.PdfFileReader(f)
    
    # Add metadata to PDF file
    pdf_reader.addMetadata({
        'title': 'My PDF file',
        'author': 'João da Silva',
        'creator': 'Python and PyPDF2',
        'producer': 'My PDF Creator'
    })
    
    # Save the PDF file with the added metadata
    with open('arquivo-metadados.pdf', 'wb') as f:
        pdf_reader.write(f)

In this example, we are using the addMetadata object method PdfFileReader to add four metadata to the PDF file: title, author, creator, and producer. So we can add more metadata as needed.

Reading information about PDF file in python

Now let’s apply the get_info object method PDFDocument to read information about the PDF file, such as the title, author, creator and producer:

import pdfminer

# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
    # Create a pdfminer.PDFDocument object to read the PDF file
    doc = pdfminer.PDFDocument(f)
    
    # Read information from PDF file
    info = doc.get_info()
    
    # Print the information
    print(info)

Validating PDF data in Python

Now we have another example of using the library pydantic, creating a data model and using the method validate() to validate data from a PDF extensions file:

import pydantic

# Create a data model for the PDF file
class PdfFile(pydantic.BaseModel):
    title: str
    author: str
    creator: str
    producer: str

# Create a pydantic.PDFFile object to read the PDF file
with open('arquivo.pdf', 'rb') as f:
    pdf_file = PdfFile(f)

# Validate PDF file data
if pdf_file.validate():
    print("The data in the PDF file is valid.")
else:
    print("The data in the PDF file is not valid.")

In this example, we are creating a data model PdfFile with four fields: title, author, creator and producer. Next, we are creating an object PdfFile from the PDF file and using the method validate() to validate the data from the PDF file.

If the data in the PDF file is valid, the method validate() returns True and prints the message “The data in the PDF file is valid.”. Otherwise, the method validate() will return False and print the message “The data in the PDF file is not valid.”.

This way, we can adapt this example for our own purposes by creating a custom data model for the PDF file and using the method validate() to validate the data in the PDF file.

Examples of how to convert PDF files in Python

To convert PDF files to other formats using Python, we can is using some libraries. Thus, allowing you to read and write PDF files and convert them to other formats, such as Image, Text, HTML, among others.

1. Converting PDF to image

Here is an example of how to convert a PDF file to a PNG image file using PyPDF2. In this example, we are opening a file called PDF arquivo.pdf and selecting the first page ( page_number = 1) to be converted to a PNG image. Next, we are using the method convertToImage() to create the image and saving it to disk with the name image.png. Look:

import PyPDF2

# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
    # Create a PyPDF2.PdfFileReader object for the PDF file
    pdf = PyPDF2.PdfFileReader(f)
    
    # Enter the number of the page we want to convert
    page_number = 1
    
    # Create a PNG image from the selected page
    image = pdf.getPage(page_number).convertToImage()
    
    # Save the image to disk
    with open('image.png', 'wb') as f:
        f.write(image)

2. Converting PDF to HTML

In addition, we convert PDF files into other formats, such as Text, HTML, among others, using the methods getPage().getText() to obtain the text from the page and to obtain the HTML code from the page, respectively. getPage(). convertToHtml()

To convert to another format, it is necessary to install the necessary libraries, for example, reportlab to convert to HTML. Look:

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
    # Create a PyPDF2.PdfFileReader object for the PDF file
    pdf = PyPDF2.PdfFileReader(f)
    
    # Enter the number of the page we want to convert
    page_number = 1
    
    # Create a reportlab.pdfgen.canvas object for the selected page
    canvas = pdf.getPage(page_number).convertToCanvas()
    
    # Set page size
    page_size = letter.A4
    
    # Create an empty HTML file
    html = ''
    
    # Add the page's HTML code to the HTML file
    html += canvas.get_html(page_size)
    
    # Save the HTML file to disk
    with open('arquivo.html', 'w') as f:
        f.write(html)

In this example, we are converting the first page of the PDF file to an HTML file. Next, we are using the method convertToCanvas() to create an object reportlab.pdfgen.canvas for the selected page and the method get_html() to get the page’s HTML code. Finally, we are saving the HTML file to disk with the name arquivo.html.

3. Converting PDF to Excel

To convert a PDF file to an Excel file using Python, we use the pandas e library openpyxl.

Here is an example of how to convert a PDF file to an Excel file using these extensions:

import pandas as pd
from openpyxl import load_workbook

# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
    # Create a pandas.DataFrame object from the PDF file
    df = pd.read_pdf(f)

# Convert the DataFrame to an Excel file
workbook = load_workbook(filename='arquivo.xlsx')
sheet = workbook.active

# Copy the DataFrame cells to the Excel file
df.to_excel(sheet, index=False)

# Save the Excel file to disk
workbook.save('file.xlsx')

In this example, we are opening a named PDF file arquivo.pdf and using the read_pdf() library method pandas to create an object pandas.DataFrame from the file contents. Next, we are converting this DataFrame to an Excel file using to_excel() the openpyxl. Finally, we are saving it to disk with the name arquivo.xlsx.

The method read_pdf() accepts several options, such skip_rows, which can be used to customize the reading of the PDF file. Thus, The method to_excel() accepts several options, such as sheet_name and index, which are used to customize the writing of the Excel file.

Therefore, it is important to remember that the quality of the conversion may vary depending on the content of the PDF file and the configuration of the reading and writing options.

Working with attachments in a PDF file in Python

To work with attachments in a PDF file using Python and add an attachment to a PDF file using PyPDF2, let’s follow the following steps:

Install a library PyPDF2 using the command pip install PyPDF2.
Import a library PyPDF2 in to Python code.
Open the PDF file using the PdfFileReader library function PyPDF2.
Add the attachment using the addAttachmentobject function PdfFileReader.
Save the updated PDF file using the write object function PdfFileReader.

Here is a code example that adds an attachment to a PDF file using PyPDF2:

import PyPDF2

# Open the PDF file
with open('document.pdf', 'rb') as f:
    pdf = PyPDF2.PdfFileReader(f)

# Add the attachment
pdf.addAttachment('path/to/attachment.txt', 'text/plain')

# Save the updated PDF file
with open('document_with_attachment.pdf', 'wb') as f:
    pdf.write(f)

This code opens the PDF file document.pdf, adds an attachment named attachment.txtwith the content type text/plain, and saves the updated PDF file as document_with_attachment.pdf.

In this sense, we use the function addAttachment to add attachments in other formats, such as images, audios and videos.

To read an attachment from a PDF file using PyPDF2, we use the getAttachmentobject function PdfFileReader. Thus, this function returns a tuple containing the attachment name and the attachment content.

Here is an example of code that reads an attachment from a PDF file using PyPDF2:

import PyPDF2

# Open the PDF file
with open('document_with_attachment.pdf', 'rb') as f:
    pdf = PyPDF2.PdfFileReader(f)

# Read the attachment
attachment = pdf.getAttachment('attachment.txt')

# Print attachment content
print(attachment[1])

So, this code opens the PDF file document_with_attachment.pdf, searches for the called attachment attachment.txtand prints the contents of the attachment.

Share the Post:

Schenia T

Data scientist, passionate about technology tools and games. Undergraduate student in Statistics at UFPB. Her hobby is binge-watching series, enjoying good music working or cooking, going to the movies and learning new things!