PDF Official Toolkit

A standardized skill for PDF processing following best practices. Covers the complete PDF lifecycle from reading and extraction through creation, manipulation, and format conversion with production-ready patterns.

When to Use This Skill

Choose this skill when:

Building production PDF processing pipelines
Extracting content with proper error handling and validation
Creating professional PDF reports and documents
Implementing PDF manipulation workflows (merge, split, watermark)
Converting between PDF and other document formats

Consider alternatives when:

Need AI-assisted PDF analysis → use a PDF Anthropic skill
Working with DOCX files → use a DOCX skill
Need web-based PDF rendering → use PDF.js
Creating spreadsheets → use an XLSX skill

Quick Start


# Production PDF extraction with fallback chain
import pdfplumber
from pathlib import Path

def robust_extract(pdf_path: str) -> dict:
    """Extract PDF content with multiple strategy fallbacks."""
    path = Path(pdf_path)
    if not path.exists():
        raise FileNotFoundError(f"PDF not found: {pdf_path}")

    result = {'pages': [], 'metadata': {}, 'tables': []}

    with pdfplumber.open(pdf_path) as pdf:
        result['metadata'] = pdf.metadata or {}

        for i, page in enumerate(pdf.pages):
            page_data = {
                'number': i + 1,
                'text': page.extract_text() or '',
                'tables': page.extract_tables() or [],
                'dimensions': {
                    'width': float(page.width),
                    'height': float(page.height),
                },
            }

            if not page_data['text'] and not page_data['tables']:
                page_data['needs_ocr'] = True

            result['pages'].append(page_data)
            result['tables'].extend(page_data['tables'])

    return result

Core Concepts

PDF Processing Capabilities

Capability	Library	Method
Text extraction	pdfplumber	`page.extract_text()`
Table extraction	pdfplumber	`page.extract_tables()`
Image extraction	pdfplumber/pymupdf	`page.images` / `get_pixmap()`
Metadata reading	PyPDF2	`reader.metadata`
Page merging	PyPDF2	`writer.add_page()`
Page splitting	PyPDF2	Iterate `reader.pages`
Watermarking	reportlab + PyPDF2	Overlay watermark page
Form filling	PyPDF2/pdfrw	Update form field values
Encryption	PyPDF2	`writer.encrypt()`
PDF creation	reportlab	`SimpleDocTemplate`

Watermarking Pattern


from PyPDF2 import PdfReader, PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from io import BytesIO

def add_watermark(input_path: str, output_path: str, text: str = 'CONFIDENTIAL'):
    # Create watermark
    buffer = BytesIO()
    c = canvas.Canvas(buffer, pagesize=A4)
    c.setFont('Helvetica', 50)
    c.setFillAlpha(0.15)
    c.setFillColorRGB(0.5, 0.5, 0.5)
    c.translate(A4[0]/2, A4[1]/2)
    c.rotate(45)
    c.drawCentredString(0, 0, text)
    c.save()
    buffer.seek(0)

    watermark = PdfReader(buffer)
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        page.merge_page(watermark.pages[0])
        writer.add_page(page)

    with open(output_path, 'wb') as f:
        writer.write(f)

Batch Processing Pipeline


import os
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

def batch_process_pdfs(input_dir: str, output_dir: str, operations: list[str]):
    """Process multiple PDFs with specified operations."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    pdf_files = list(Path(input_dir).glob('*.pdf'))

    def process_one(pdf_path: Path) -> dict:
        try:
            result = robust_extract(str(pdf_path))
            output_path = Path(output_dir) / f'{pdf_path.stem}_processed.json'

            if 'extract_text' in operations:
                text = '\n'.join(p['text'] for p in result['pages'])
                (Path(output_dir) / f'{pdf_path.stem}.txt').write_text(text)

            if 'extract_tables' in operations:
                import json
                tables_path = Path(output_dir) / f'{pdf_path.stem}_tables.json'
                tables_path.write_text(json.dumps(result['tables'], indent=2))

            return {'file': pdf_path.name, 'status': 'success', 'pages': len(result['pages'])}
        except Exception as e:
            return {'file': pdf_path.name, 'status': 'error', 'error': str(e)}

    with ProcessPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(process_one, pdf_files))

    return results

Configuration

Parameter	Type	Default	Description
`extractionLib`	string	`'pdfplumber'`	Text extraction: pdfplumber or pymupdf
`manipulationLib`	string	`'PyPDF2'`	PDF manipulation: PyPDF2 or pypdf
`creationLib`	string	`'reportlab'`	PDF creation: reportlab or fpdf2
`ocrFallback`	boolean	`true`	Enable OCR for image-only pages
`parallelWorkers`	number	`4`	Concurrent workers for batch processing
`maxFileSize`	number	`100`	Max PDF size in MB

Best Practices

Use pdfplumber for extraction, PyPDF2 for manipulation, reportlab for creation — Each library excels at its specialty. Using the right tool for each operation produces the best results and simplest code.
Implement a processing pipeline with error recovery — Production PDF processing encounters corrupt files, password-protected PDFs, and encoding issues. Wrap each step in try/except, log failures, and continue processing remaining files.
Validate extracted data before downstream use — PDF text extraction is lossy. Verify that extracted amounts, dates, and identifiers match expected patterns. Flag anomalies for manual review rather than processing incorrect data silently.
Process pages as streams for large PDFs — Loading all pages of a 500-page PDF into memory simultaneously exhausts RAM. Process pages one at a time and write output incrementally.
Test with diverse PDF sources — PDFs created by different applications (Word, LaTeX, web browsers, scanners) have different internal structures. Test your extraction code with PDFs from all expected sources.

Common Issues

Password-protected PDFs fail silently — PyPDF2 raises PdfReadError for encrypted PDFs. Check for encryption first with reader.is_encrypted and attempt decryption with a provided password before processing.

Text extraction returns text in wrong order — PDFs don't store text in reading order. pdfplumber uses spatial analysis but can fail with complex multi-column layouts. Sort extracted text blocks by position (top-to-bottom, left-to-right) for more natural ordering.

Generated PDFs are very large — Embedded images without compression inflate file size. Compress images before embedding, use JPEG for photos and PNG for diagrams. Set image DPI to 150 for screen viewing, 300 for print.

⚠️ Loading Issue

Pdf Official Toolkit

PDF Official Toolkit

When to Use This Skill

Quick Start

Core Concepts

PDF Processing Capabilities

Watermarking Pattern

Batch Processing Pipeline

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace