The PDF extensions libraries in Python (ext:pdf) allow you to work with PDF files. In this way, it allows you not only to read and write PDF files, but also to manipulate their contents, such as adding, removing and changing pages, form fields and metadata. Furthermore, the library also allows us to convert PDF files into other formats, such as images and texts.
PDF files are one of the most popular file forms for documents, both for personal and professional use. And as developers, it is often necessary to work with these files in our applications. Python, fortunately, has some powerful and easy-to-use libraries for dealing with PDF files.
In this article, we will explore how to use libraries to work with PDF files in Python. Let’s see how to install and import the library, how to create and manipulate PDF files, and how to use some of the library’s more advanced features.
Table of Contents
Popular PDF extension libraries in Python
There are several popular PDF extension libraries for Python, each with its own functionality and application in different contexts. Here are some of the most popular libraries, remembering that we can use other libraries as we will see later:
- PyPDF2 : It is a lightweight and easy-to-use library for manipulating PDFs in Python. Therefore, it provides functions for reading and writing PDFs, as well as adding, removing and manipulating pages.
- pdfminer : It is a PDF processing library that allows you to extract information from PDFs, such as text, layout, images and annotations. Also used to identify and separate different parts of a PDF, such as covers, pages and attachments.
- pdfquery : It is a PDF query library that allows you to perform SQL queries on PDFs. Thus, we convert PDFs into a form
tabular
so that queries can be performed on top of them. - pdfkit : This is a PDF management library that allows you to create, read, manipulate and write PDFs in Python. In this sense, including functions to add, remove, convert PDFs into other formats, such as Images and manipulate pages, as well as to sign and protect PDFs.
- reportlab : It is a report generation library that allows you to create complex PDFs from dynamic data. Therefore, it includes functions for creating tables, graphs, images and text, as well as supporting layout and style customization.
- pstoedit : It is a PDF editing library, including functions to add, remove and manipulate pages, as well as to change text and images in PDFs.
- pdf-reactor : It is a PDF processing library that allows you to manipulate PDFs in Python. Thus, including functions for adding, removing and manipulating pages, as well as extracting information from PDFs, such as text and annotations.
Features and Benefits of PDF Extensions in Python
PDF extensions are one of the most popular features of Python, a high-level, interpreted programming language . Thus, these pdf extensions in python allow developers to easily create, trim, and edit PDF files as well as convert PDF files to other file formats.
Here are some of the main features and benefits of PDF extensions in Python:
1. Create PDF files
PDF extensions in python allow developers to create PDF files from scratch. In this sense, we use a library reportlab
, which is one of the main PDF generation libraries in Python. With this library, developers can create pages, add text and images, define layouts and styles, among other features.
Example of how to create a PDF file using a library reportlab
in Python:
import reportlab.lib.pagesizes as pagesizes
from reportlab.pdfgen import canvas
# Create a PageSize object
page_size = pagesizes.letter()
# Create a Canvas object
canvas = canvas.Canvas('example.pdf')
# set page size
canvas.setPageSize(page_size)
# add text to page
text = 'Hello, world!'
canvas.drawString(100, 750, text)
# add an image to the page
image = 'example.jpg'
canvas.drawImage(image, (100, 500))
# add a line to the page
canvas.drawLine(100, 250, 300, 250)
# add a rectangle to the page
canvas.drawRect(100, 150, 300, 50)
# close the PDF file
canvas.showPage()
canvas.save()
This example creates a PDF file called “example.pdf” with a letter-sized page (21.59 cm x 27.94 cm), with text, an image, and a rectangle drawn on the page.
Therefore, we use the object PageSize
to set the page size, and we use the object Canvas
to create the page and add elements to it, and we use the method save
to save the PDF file.
2. Reading PDF files
PDF extensions also allow developers to read and analyze existing PDF files. In this sense, we can be using a library pyPDF2
, which is one of the main PDF reading libraries in Python. With this library, developers can access and manipulate the content of a PDF file, such as text, images, and metadata.
Now let’s see an example of how to read a PDF file using a library pyPDF2
in Python, opening a PDF file called “example.pdf” and then reading the number of pages it has. It then reads the contents of the first page of the PDF file and prints it to standard output. See below:
import pyPDF2
# Open a PDF file
with open('example.pdf', 'rb') as f:
# Create a PDFFile object
pdf_file = pyPDF2.PDFFile(f)
# Read the number of pages in the PDF file
page_count = pdf_file.getNumPages()
print(f'Number of pages: {page_count}')
# Read the content of the first page
page_content = pdf_file.getPage(0).extractText()
print(page_content)
The object PDFFile
is used to open the PDF file and access its pages and content. The method getNumPages
is used to read the number of pages in the PDF file, and the method getPage
is used to read the content of the first page. The method extractText
is used to extract text from the page.
3. Editing PDF files
In addition to creating and reading PDF files, python PDF extensions also allow developers to edit existing PDF files. Therefore, we use a library pdftotext
, which allows developers to extract texts from PDF files and convert them to plain text formats, such as plain text format.
In the example below, we will see how to edit a PDF file using a library pyPDF2
in Python. First, open a PDF file called “example.pdf,” add a new page to the end of the file, and add text and an image to the new page. Then it saved the edited PDF file as “edited_example.pdf”.
import pyPDF2
# Open a PDF file
with open('example.pdf', 'rb') as f:
# Create a PDFFile object
pdf_file = pyPDF2.PDFFile(f)
# Add a new page to the PDF file
pdf_file.addPage(pyPDF2.Page(100, 100))
# Add text to new page
text = 'This is a new page!'
pdf_file.getPage(1).drawString(50, 50, text)
# Add an image to the new page
image = 'example.jpg'
pdf_file.getPage(1).drawImage(image, (100, 100))
# Save the edited PDF file
pdf_file.save('edited_example.pdf')
The object PDFFile
is used to open the PDF file and add a new page to the end of the file. The method addPage
is used to add a new page, and the method drawString
is used to add text to the page. The method drawImage
is used to add an image to the page. Finally, the method save
is used to save the edited PDF file.
4. PDF file conversion
PDF extensions also allow developers to convert PDF files to other file formats. Thus, we are using a library pdf2image
, which allows developers to convert pages of a PDF file into raster images, such as JPEG or PNG.
In this example of how to convert a PDF file to a text file using a library pyPDF2
in Python, we open a PDF file called “example.pdf” and extract the text from all pages of the PDF file. It then saves the text to a text file called “example.txt”. Look:
import pyPDF2
# Open a PDF file
with open('example.pdf', 'rb') as f:
# Create a PDFFile object
pdf_file = pyPDF2.PDFFile(f)
# Extract text from PDF file
text = ''
for page in pdf_file.pages:
text += page.extractText()
# Save the text to a text file
with open('example.txt', 'w') as f:
f.write(text)
The object PDFFile
is used to open the PDF file and access its pages. The method extractText
is used to extract the text from each page of the PDF file. The text is saved to a variable and then saved to a text file using the write
object method open
.
5. Integration with other technologies
We also use PDF extensions integrated with other technologies, such as Django , a web development framework for Python, or Selenium , a test automation library for Python. In this way, this integration allows developers to create customized solutions for their specific needs.
Suppose we want to create a document management system that allows users to upload PDF files, extract information from them, and store them in a database. To do this, we can use a library pyPDF2
to handle the PDF files and a database, such as MySQL or MongoDB , to store the extracted information.
Here is an example:
import pyPDF2
import mysql.connector
# Create a connection to the database
cnx = mysql.connector.connect(
user='user',
password='password',
host='localhost',
database='data_bank'
)
# Create a table in the database to store the extracted information
cursor = cnx.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS pdf_info (id INT PRIMARY KEY, name VARCHAR(255), data TEXT)')
# Open a PDF file and extract sensitive information
with open('example.pdf', 'rb') as f:
pdf_file = pyPDF2.PDFFile(f)
name = pdf_file.getTitle()
data = pdf_file.getPage(0).extractText()
# Save the extracted information in the database
cursor.execute('INSERT INTO pdf_info (name, data) VALUES (%s, %s)', (name, data))
cnx.commit()
# Close the connection to the database
cnx.close()
This example applies a library mysql.connector
to connect to a MySQL database and create a table to store information extracted from PDF files.
How to create a PDF file from a data model using PDF extension libraries in Python
To create a PDF file from a data model in Python, we use the PDFkit Python library.
PDF kit is a Python library that allows you to create PDFs from HTML data, text, images and other file formats. Thus, providing a wide range of features to customize PDF content and appearance, including support for tables, images, lines, forms, annotations, and more.
Here is an example of how to create a PDF file from a data model in Python using PDFkit:
- First, we install a PDFkit library. Therefore, we do this as follows with the pip command:
pip install pdfkit
- Next, we need to import the PDFkit library into Python code:
from pdfkit import PDFKit
- Now, we create a PDFKit object and provide the data we want to include in the PDF. For example, when we have a data dictionary with information about a set of products, we can create a PDFKit and add this data to the PDF:
import pdfkit
# Create a dictionary with information about products
products = {
"Product 1": {
"Name": "Product 1",
"Price": 19.99,
"Description": "This is product 1"
},
"Product 2": {
"Name": "Product 2",
"Price": 29.99,
"Description": "This is product 2"
},
"Product 3": {
"Name": "Product 3",
"Price": 39.99,
"Description": "This is product 3"
}
}
# Create a PDFKit
pdf = pdfkit.PDFKit()
# Add a page to PDF
pdf.add_page()
# Add a table to the page
table = pdf.add_table(10, 10, 100, 100)
# Add product information to the table
for product, information in products.items():
table.add_row()
table.add_cell(product)
table.add_cell(information["Name"])
table.add_cell(information["Price"])
table.add_cell(information["Description"])
# Save the PDF
pdf.save("products.pdf")
Now, we can add more information to the PDF such as images, links, forms, annotations, etc.
- Adding images, we apply the
add_image()
object methodPDFKit
. For example:
pdf.add_image("path/to/image.jpg")
- With links, we use the
add_link()
object methodPDFKit
. For example:
pdf.add_link("http://www.example.com", "Link to website")
- For forms, we use the
add_form()
object methodPDFKit
. For example:
pdf.add_form(fields=[
{"name": "Name", "type": "text"},
{"name": "Email", "type": "email"},
{"name": "Telão", "type": "number"}
])
- To add annotations, we apply the
add_annotation()
object methodPDFKit
. For example:
pdf.add_annotation(text="This is an annotation example")
- In the end, we save the PDF using the
save()
object methodPDFKit
. For example:
pdf.save("file_name.pdf")
This is the basic way to create a PDF using PDFKit in Python. In this sense, we recommend consulting the official PDFKit documentation to learn more about the functionalities and resources available.
Adding metadata to a PDF file in Python
Now let’s learn how to add metadata to PDF files in Python with different libraries.
Adding metadata to a PDF file in Python
To add metadata to a PDF file in Python, we use the PyPDF2
. This library allows you to read and write PDF files and also allows you to add, change and remove metadata.
Here is an example of how to add metadata to a PDF file using PyPDF2
:
import PyPDF2
# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
# Create a PyPDF2.PdfFileReader object to read the PDF file
pdf_reader = PyPDF2.PdfFileReader(f)
# Add metadata to PDF file
pdf_reader.addMetadata({
'title': 'My PDF file',
'author': 'João da Silva',
'creator': 'Python and PyPDF2',
'producer': 'My PDF Creator'
})
# Save the PDF file with the added metadata
with open('arquivo-metadados.pdf', 'wb') as f:
pdf_reader.write(f)
In this example, we are using the addMetadata
object method PdfFileReader
to add four metadata to the PDF file: title, author, creator, and producer. So we can add more metadata as needed.
Reading information about PDF file in python
Now let’s apply the get_info
object method PDFDocument
to read information about the PDF file, such as the title, author, creator and producer:
import pdfminer
# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
# Create a pdfminer.PDFDocument object to read the PDF file
doc = pdfminer.PDFDocument(f)
# Read information from PDF file
info = doc.get_info()
# Print the information
print(info)
Validating PDF data in Python
Now we have another example of using the library pydantic
, creating a data model and using the method validate()
to validate data from a PDF extensions file:
import pydantic
# Create a data model for the PDF file
class PdfFile(pydantic.BaseModel):
title: str
author: str
creator: str
producer: str
# Create a pydantic.PDFFile object to read the PDF file
with open('arquivo.pdf', 'rb') as f:
pdf_file = PdfFile(f)
# Validate PDF file data
if pdf_file.validate():
print("The data in the PDF file is valid.")
else:
print("The data in the PDF file is not valid.")
In this example, we are creating a data model PdfFile
with four fields: title
, author
, creator
and producer
. Next, we are creating an object PdfFile
from the PDF file and using the method validate()
to validate the data from the PDF file.
If the data in the PDF file is valid, the method validate()
returns True
and prints the message “The data in the PDF file is valid.”. Otherwise, the method validate()
will return False
and print the message “The data in the PDF file is not valid.”.
This way, we can adapt this example for our own purposes by creating a custom data model for the PDF file and using the method validate()
to validate the data in the PDF file.
Examples of how to convert PDF files in Python
To convert PDF files to other formats using Python, we can is using some libraries
. Thus, allowing you to read and write PDF files and convert them to other formats, such as Image, Text, HTML, among others.
1. Converting PDF to image
Here is an example of how to convert a PDF file to a PNG image file using PyPDF2
. In this example, we are opening a file called PDF arquivo.pdf
and selecting the first page ( page_number = 1
) to be converted to a PNG image. Next, we are using the method convertToImage()
to create the image and saving it to disk with the name image.png
. Look:
import PyPDF2
# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
# Create a PyPDF2.PdfFileReader object for the PDF file
pdf = PyPDF2.PdfFileReader(f)
# Enter the number of the page we want to convert
page_number = 1
# Create a PNG image from the selected page
image = pdf.getPage(page_number).convertToImage()
# Save the image to disk
with open('image.png', 'wb') as f:
f.write(image)
2. Converting PDF to HTML
In addition, we convert PDF files into other formats, such as Text, HTML, among others, using the methods getPage().getText()
to obtain the text from the page and to obtain the HTML code from the page, respectively. getPage(). convertToHtml()
To convert to another format, it is necessary to install the necessary libraries, for example, reportlab
to convert to HTML. Look:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
# Create a PyPDF2.PdfFileReader object for the PDF file
pdf = PyPDF2.PdfFileReader(f)
# Enter the number of the page we want to convert
page_number = 1
# Create a reportlab.pdfgen.canvas object for the selected page
canvas = pdf.getPage(page_number).convertToCanvas()
# Set page size
page_size = letter.A4
# Create an empty HTML file
html = ''
# Add the page's HTML code to the HTML file
html += canvas.get_html(page_size)
# Save the HTML file to disk
with open('arquivo.html', 'w') as f:
f.write(html)
In this example, we are converting the first page of the PDF file to an HTML file. Next, we are using the method convertToCanvas()
to create an object reportlab.pdfgen.canvas
for the selected page and the method get_html()
to get the page’s HTML code. Finally, we are saving the HTML file to disk with the name arquivo.html
.
3. Converting PDF to Excel
To convert a PDF file to an Excel file using Python, we use the pandas
e library openpyxl
.
Here is an example of how to convert a PDF file to an Excel file using these extensions:
import pandas as pd
from openpyxl import load_workbook
# Open the PDF file
with open('arquivo.pdf', 'rb') as f:
# Create a pandas.DataFrame object from the PDF file
df = pd.read_pdf(f)
# Convert the DataFrame to an Excel file
workbook = load_workbook(filename='arquivo.xlsx')
sheet = workbook.active
# Copy the DataFrame cells to the Excel file
df.to_excel(sheet, index=False)
# Save the Excel file to disk
workbook.save('file.xlsx')
In this example, we are opening a named PDF file arquivo.pdf
and using the read_pdf()
library method pandas
to create an object pandas.DataFrame
from the file contents. Next, we are converting this DataFrame to an Excel file using to_excel()
the openpyxl
. Finally, we are saving it to disk with the name arquivo.xlsx
.
The method read_pdf()
accepts several options, such skip_rows
, which can be used to customize the reading of the PDF file. Thus, The method to_excel()
accepts several options, such as sheet_name
and index
, which are used to customize the writing of the Excel file.
Therefore, it is important to remember that the quality of the conversion may vary depending on the content of the PDF file and the configuration of the reading and writing options.
Working with attachments in a PDF file in Python
To work with attachments in a PDF file using Python and add an attachment to a PDF file using PyPDF2
, let’s follow the following steps:
- Install a library
PyPDF2
using the commandpip install PyPDF2
. - Import a library
PyPDF2
in to Python code. - Open the PDF file using the
PdfFileReader
library functionPyPDF2
. - Add the attachment using the
addAttachment
object functionPdfFileReader
. - Save the updated PDF file using the
write
object functionPdfFileReader
.
Here is a code example that adds an attachment to a PDF file using PyPDF2
:
import PyPDF2
# Open the PDF file
with open('document.pdf', 'rb') as f:
pdf = PyPDF2.PdfFileReader(f)
# Add the attachment
pdf.addAttachment('path/to/attachment.txt', 'text/plain')
# Save the updated PDF file
with open('document_with_attachment.pdf', 'wb') as f:
pdf.write(f)
This code opens the PDF file document.pdf
, adds an attachment named attachment.txt
with the content type text/plain
, and saves the updated PDF file as document_with_attachment.pdf
.
In this sense, we use the function addAttachment
to add attachments in other formats, such as images, audios and videos.
To read an attachment from a PDF file using PyPDF2
, we use the getAttachment
object function PdfFileReader
. Thus, this function returns a tuple containing the attachment name and the attachment content.
Here is an example of code that reads an attachment from a PDF file using PyPDF2
:
import PyPDF2
# Open the PDF file
with open('document_with_attachment.pdf', 'rb') as f:
pdf = PyPDF2.PdfFileReader(f)
# Read the attachment
attachment = pdf.getAttachment('attachment.txt')
# Print attachment content
print(attachment[1])
So, this code opens the PDF file document_with_attachment.pdf
, searches for the called attachment attachment.txt
and prints the contents of the attachment.