How to Create PDF Files to Word Using Python

Converting a PDF file to a Word document using Python is a common task that can save you hours of manual work. Recently, I embarked on a project to build a practical Python script for this purpose, and I’m excited to share the process with you. Let’s dive into how I solved this, added enhancements, and built a robust PDF-to-Word converter.

Makes This Code Special

This project started with a simple goal: convert PDF files to Word documents efficiently. However, I wanted more than just a basic converter. Here’s how I enhanced the functionality:

  1. Dynamic Input and Output: You can enter custom file names for both the PDF file and the resulting Word document.
  2. Error Handling: The script checks if the PDF file exists and gracefully handles missing or inaccessible files.
  3. Proper Formatting: The output is saved in .docx format using the python-docx library, ensuring proper Word document structure.
  4. Readable Output: Each page from the PDF is added as a separate paragraph, improving the readability of the Word file.
  5. Progress Indicator: The script includes a progress bar to keep you informed about the conversion process.

The Full Code

Here’s the Python code for the PDF-to-Word conversion project:

import os
from PyPDF2 import PdfReader
from docx import Document
from tqdm import tqdm

# Function to convert PDF to Word
def pdf_to_word(pdf_path, word_path):
    # Check if the PDF file exists
    if not os.path.exists(pdf_path):
        print(f"Error: The file '{pdf_path}' does not exist.")
        return

    try:
        # Open the PDF file
        pdf_reader = PdfReader(pdf_path)
        num_pages = len(pdf_reader.pages)

        # Create a Word document
        doc = Document()

        print(f"Converting '{pdf_path}' to '{word_path}'...")

        # Iterate through each page and extract text
        for page in tqdm(pdf_reader.pages, desc="Processing pages", unit="page"):
            text = page.extract_text()
            if text.strip():  # Check if the page contains text
                doc.add_paragraph(text)
            else:
                doc.add_paragraph("[This page is blank or contains non-extractable content]")

        # Save the Word document
        doc.save(word_path)
        print(f"Conversion complete! Saved as '{word_path}'.")

    except Exception as e:
        print(f"An error occurred: {e}")

# Main script
if __name__ == "__main__":
    # Get input and output file paths from the user
    pdf_file = input("Enter the path to the PDF file (e.g., 'clcoding.pdf'): ").strip()
    word_file = input("Enter the path to save the Word file (e.g., 'clcodingdocx.docx'): ").strip()

    # Perform the conversion
    pdf_to_word(pdf_file, word_file)

How to Run the Code

Here’s a step-by-step guide to run this script:

Install Dependencies

To make this script work, you need the following Python libraries:

pip install PyPDF2 python-docx tqdm

Save the Script

Save the code above as pdf_to_word.py in your preferred directory.

Run the Script

Open your terminal or command prompt, navigate to the directory containing pdf_to_word.py, and run:

python pdf_to_word.py

Provide Input/Output Paths

The script will prompt you to enter:

  • The path to the PDF file (e.g., clcoding.pdf).
  • The desired path for the Word file (e.g., clcodingdocx.docx).

Check the Output

After the script runs, you’ll find a beautifully formatted Word document in the location you specified.

Features and Benefits

  • Handles Missing Files: If the PDF file doesn’t exist, the script alerts you and exits gracefully.
  • Customizable Output: Allows you to name and locate the Word file as you prefer.
  • Readable Formatting: Ensures each page’s content is distinct, with blank or problematic pages marked clearly.
  • Progress Bar: Keeps you informed about the script’s progress, especially useful for large PDFs.
  • Error-Free Execution: Catches and displays any errors that occur during the conversion process.

Real-World Applications

This script can be used in various scenarios:

  1. Document Management: Convert scanned contracts, manuals, or reports from PDF to editable Word format.
  2. Education: Extract text from academic papers or e-books for editing or note-taking.
  3. Content Editing: Prepare content for blogs or articles by extracting text from PDFs.

Final Thoughts

This PDF-to-Word conversion script is an excellent example of how Python can automate repetitive tasks and save time. By adding features like error handling, proper formatting, and progress tracking, I ensured the script is practical and user-friendly. Whether you’re a beginner or an experienced programmer, this project demonstrates the power of Python for solving everyday problems.

Related blog posts