PDF System

A practical skill for PDF processing including creation, text extraction, merging, splitting, form filling, and conversion. Covers both Python and command-line approaches for common PDF operations.

When to Use This Skill

Choose this skill when:

Extracting text and metadata from PDF documents
Creating PDF reports with tables, charts, and formatting
Merging, splitting, or rotating PDF files
Filling PDF forms programmatically
Converting between PDF and other formats (images, HTML, DOCX)

Consider alternatives when:

Working with DOCX files → use a DOCX skill
Creating presentations → use a PPTX skill
Building a web-based PDF viewer → use a PDF.js skill
Need OCR for scanned documents → use an OCR skill

Quick Start


# Install PDF tools
pip install PyPDF2 reportlab pdfplumber
# CLI tools
brew install poppler  # pdftotext, pdfimages
brew install qpdf     # merge, split, encrypt


# Extract text from PDF
import pdfplumber

def extract_text(pdf_path: str) -> str:
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text() or ''
            text += '\n\n'
        return text.strip()

# Extract tables from PDF
def extract_tables(pdf_path: str) -> list:
    with pdfplumber.open(pdf_path) as pdf:
        all_tables = []
        for page in pdf.pages:
            tables = page.extract_tables()
            all_tables.extend(tables)
        return all_tables

Core Concepts

PDF Operations Matrix

Operation	Python Library	CLI Tool
Read text	pdfplumber, PyPDF2	pdftotext (poppler)
Read tables	pdfplumber, camelot	tabula-java
Create PDF	reportlab, fpdf2	wkhtmltopdf
Merge PDFs	PyPDF2	qpdf, pdfunite
Split PDF	PyPDF2	qpdf
Form filling	PyPDF2, pdfrw	pdftk
Encrypt	PyPDF2	qpdf
Images to PDF	Pillow + reportlab	img2pdf
PDF to images	pdf2image	pdftoppm

PDF Report Generation


from reportlab.lib import colors
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer

def create_report(output_path: str, title: str, data: list[list]):
    doc = SimpleDocTemplate(output_path, pagesize=A4)
    styles = getSampleStyleSheet()
    elements = []

    # Title
    elements.append(Paragraph(title, styles['Title']))
    elements.append(Spacer(1, 20))

    # Table
    table = Table(data)
    table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('FONTSIZE', (0, 0), (-1, 0), 12),
        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
        ('GRID', (0, 0), (-1, -1), 1, colors.black),
        ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]),
    ]))
    elements.append(table)

    doc.build(elements)

# Usage
create_report('report.pdf', 'Sales Report Q1', [
    ['Product', 'Units', 'Revenue'],
    ['Widget A', '1,200', '$36,000'],
    ['Widget B', '800', '$24,000'],
])

PDF Manipulation


from PyPDF2 import PdfReader, PdfWriter

def merge_pdfs(paths: list[str], output: str):
    writer = PdfWriter()
    for path in paths:
        reader = PdfReader(path)
        for page in reader.pages:
            writer.add_page(page)
    with open(output, 'wb') as f:
        writer.write(f)

def split_pdf(path: str, output_dir: str):
    reader = PdfReader(path)
    for i, page in enumerate(reader.pages):
        writer = PdfWriter()
        writer.add_page(page)
        with open(f'{output_dir}/page_{i+1}.pdf', 'wb') as f:
            writer.write(f)

def encrypt_pdf(path: str, password: str, output: str):
    reader = PdfReader(path)
    writer = PdfWriter()
    for page in reader.pages:
        writer.add_page(page)
    writer.encrypt(password)
    with open(output, 'wb') as f:
        writer.write(f)

Configuration

Parameter	Type	Default	Description
`extractionEngine`	string	`'pdfplumber'`	Text extraction: pdfplumber, PyPDF2, or poppler
`creationEngine`	string	`'reportlab'`	PDF creation: reportlab or fpdf2
`pageSize`	string	`'A4'`	Default page size: A4, Letter, or Legal
`ocrFallback`	boolean	`false`	Use OCR when text extraction fails
`imageFormat`	string	`'png'`	Format for PDF-to-image conversion
`imageDPI`	number	`300`	DPI for image conversion

Best Practices

Use pdfplumber for text extraction, not PyPDF2 — pdfplumber handles complex layouts, multi-column text, and tables significantly better than PyPDF2's basic text extraction. PyPDF2 is better for structural manipulation (merge, split, encrypt).
Handle scanned PDFs with OCR fallback — Many PDFs are scanned images with no extractable text. Check if extracted text is empty and fall back to Tesseract OCR: pytesseract.image_to_string(page_image).
Use reportlab for complex PDF creation, fpdf2 for simple ones — reportlab offers full layout control with tables, charts, and precise positioning. fpdf2 is simpler for text-heavy documents. Choose based on complexity requirements.
Process large PDFs page by page, not all at once — Loading a 1000-page PDF entirely into memory can exhaust RAM. Process pages iteratively using generators and write output incrementally.
Validate PDF output by opening in multiple viewers — PDFs that render correctly in one viewer may break in another. Test generated PDFs in Adobe Reader, Chrome's built-in viewer, and Preview (macOS) to ensure compatibility.

Common Issues

Text extraction returns garbled characters — The PDF may use custom font encoding or non-standard character mappings. Try different extraction libraries (pdfplumber vs poppler's pdftotext). Some PDFs require OCR even though they appear to contain text.

Merged PDF has incorrect page orientation — Source PDFs may have different page sizes or orientations. Check and normalize page dimensions before merging, or rotate pages to match a target orientation.

Table extraction misaligns columns — PDF tables don't have true table structure — they're just positioned text. pdfplumber's extract_tables() uses heuristics that can fail on complex layouts. Specify explicit table boundaries with table_settings for better accuracy.

⚠️ Loading Issue

Pdf System

PDF System

When to Use This Skill

Quick Start

Core Concepts

PDF Operations Matrix

PDF Report Generation

PDF Manipulation

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace