PDF Processing System

A streamlined skill for common PDF processing tasks using Python. Covers text extraction, page manipulation, basic creation, and format conversion with minimal setup and clear examples.

When to Use This Skill

Choose this skill when:

Quickly extracting text from PDF files for processing
Performing basic PDF operations (merge, split, rotate)
Converting PDFs to text or images
Filling simple PDF forms
Building lightweight PDF processing scripts

Consider alternatives when:

Need advanced AI-assisted analysis → use a PDF Anthropic skill
Creating complex PDF reports → use a reportlab/PDF creation skill
Processing batches of documents at scale → use a PDF pipeline skill
Working with scanned documents → use an OCR-focused skill

Quick Start


import pdfplumber

# Extract text from a PDF
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if text:
            print(text)


# Merge multiple PDFs
from PyPDF2 import PdfReader, PdfWriter

writer = PdfWriter()
for pdf_file in ['part1.pdf', 'part2.pdf', 'part3.pdf']:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open('combined.pdf', 'wb') as output:
    writer.write(output)

Core Concepts

Common Operations

Task	Code
Extract text	`pdfplumber.open(f).pages[0].extract_text()`
Page count	`len(PdfReader(f).pages)`
Merge PDFs	Loop: `writer.add_page(page)`
Split PDF	Each page → separate `PdfWriter`
Rotate page	`page.rotate(90)`
Get metadata	`PdfReader(f).metadata`
Encrypt	`writer.encrypt('password')`
Extract images	`page.images` (pdfplumber)
PDF to images	`pdf2image.convert_from_path(f)`

Text Processing Pipeline


def process_pdf_text(pdf_path: str) -> dict:
    """Extract and structure text content from a PDF."""
    with pdfplumber.open(pdf_path) as pdf:
        pages = []
        for i, page in enumerate(pdf.pages):
            text = page.extract_text() or ''
            tables = page.extract_tables() or []
            pages.append({
                'page_number': i + 1,
                'text': text,
                'word_count': len(text.split()),
                'has_tables': len(tables) > 0,
                'table_count': len(tables),
            })

        return {
            'total_pages': len(pages),
            'total_words': sum(p['word_count'] for p in pages),
            'pages': pages,
        }

Format Conversion


# PDF to text
pdftotext document.pdf output.txt

# PDF to images (one per page)
pdftoppm document.pdf output -png -r 300

# PDF to HTML
pdftohtml document.pdf output.html

# Images to PDF
img2pdf *.png -o output.pdf

# HTML to PDF
wkhtmltopdf page.html output.pdf

Configuration

Parameter	Type	Default	Description
`extractionLib`	string	`'pdfplumber'`	Library: pdfplumber or PyPDF2
`dpi`	number	`300`	DPI for PDF-to-image conversion
`encoding`	string	`'utf-8'`	Output text encoding
`pageRange`	string	`'all'`	Pages to process: all, 1-5, or specific numbers
`preserveLayout`	boolean	`false`	Attempt to preserve spatial text layout

Best Practices

Use pdfplumber for reading, PyPDF2 for writing — pdfplumber is superior for text and table extraction. PyPDF2 is better for structural operations (merge, split, encrypt, rotate).
Check for empty text before processing — Some PDF pages are images with no extractable text. Always verify page.extract_text() returns non-empty content before downstream processing.
Handle exceptions per-page rather than per-document — One corrupted page shouldn't prevent processing the other 99 pages. Wrap individual page processing in try/except and log errors per page.
Use CLI tools for batch format conversion — Python libraries work for individual files, but pdftotext, pdftoppm, and wkhtmltopdf are faster and more reliable for batch operations.
Close PDF files explicitly or use context managers — PDF files lock the underlying file handle. Always use with statements or explicitly close readers to prevent file locking issues.

Common Issues

pdfplumber crashes on specific PDFs — Some PDFs have malformed internal structures. Wrap extraction in try/except and fall back to PyPDF2 or CLI tools (pdftotext) for problematic files.

Text extraction misses content in text boxes or annotations — pdfplumber extracts body text but may miss text in annotations, comments, or form fields. Check page.annots for annotation content.

Large PDFs consume excessive memory — Process pages one at a time rather than loading all pages simultaneously. Use generators for lazy page processing.

⚠️ Loading Issue

Pdf Processing System

PDF Processing System

When to Use This Skill

Quick Start

Core Concepts

Common Operations

Text Processing Pipeline

Format Conversion

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace