Pdf Official Toolkit
Battle-tested skill for comprehensive, manipulation, toolkit, extracting. Includes structured workflows, validation checks, and reusable patterns for document processing.
PDF Official Toolkit
A standardized skill for PDF processing following best practices. Covers the complete PDF lifecycle from reading and extraction through creation, manipulation, and format conversion with production-ready patterns.
When to Use This Skill
Choose this skill when:
- Building production PDF processing pipelines
- Extracting content with proper error handling and validation
- Creating professional PDF reports and documents
- Implementing PDF manipulation workflows (merge, split, watermark)
- Converting between PDF and other document formats
Consider alternatives when:
- Need AI-assisted PDF analysis → use a PDF Anthropic skill
- Working with DOCX files → use a DOCX skill
- Need web-based PDF rendering → use PDF.js
- Creating spreadsheets → use an XLSX skill
Quick Start
# Production PDF extraction with fallback chain import pdfplumber from pathlib import Path def robust_extract(pdf_path: str) -> dict: """Extract PDF content with multiple strategy fallbacks.""" path = Path(pdf_path) if not path.exists(): raise FileNotFoundError(f"PDF not found: {pdf_path}") result = {'pages': [], 'metadata': {}, 'tables': []} with pdfplumber.open(pdf_path) as pdf: result['metadata'] = pdf.metadata or {} for i, page in enumerate(pdf.pages): page_data = { 'number': i + 1, 'text': page.extract_text() or '', 'tables': page.extract_tables() or [], 'dimensions': { 'width': float(page.width), 'height': float(page.height), }, } if not page_data['text'] and not page_data['tables']: page_data['needs_ocr'] = True result['pages'].append(page_data) result['tables'].extend(page_data['tables']) return result
Core Concepts
PDF Processing Capabilities
| Capability | Library | Method |
|---|---|---|
| Text extraction | pdfplumber | page.extract_text() |
| Table extraction | pdfplumber | page.extract_tables() |
| Image extraction | pdfplumber/pymupdf | page.images / get_pixmap() |
| Metadata reading | PyPDF2 | reader.metadata |
| Page merging | PyPDF2 | writer.add_page() |
| Page splitting | PyPDF2 | Iterate reader.pages |
| Watermarking | reportlab + PyPDF2 | Overlay watermark page |
| Form filling | PyPDF2/pdfrw | Update form field values |
| Encryption | PyPDF2 | writer.encrypt() |
| PDF creation | reportlab | SimpleDocTemplate |
Watermarking Pattern
from PyPDF2 import PdfReader, PdfWriter from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import A4 from io import BytesIO def add_watermark(input_path: str, output_path: str, text: str = 'CONFIDENTIAL'): # Create watermark buffer = BytesIO() c = canvas.Canvas(buffer, pagesize=A4) c.setFont('Helvetica', 50) c.setFillAlpha(0.15) c.setFillColorRGB(0.5, 0.5, 0.5) c.translate(A4[0]/2, A4[1]/2) c.rotate(45) c.drawCentredString(0, 0, text) c.save() buffer.seek(0) watermark = PdfReader(buffer) reader = PdfReader(input_path) writer = PdfWriter() for page in reader.pages: page.merge_page(watermark.pages[0]) writer.add_page(page) with open(output_path, 'wb') as f: writer.write(f)
Batch Processing Pipeline
import os from concurrent.futures import ProcessPoolExecutor from pathlib import Path def batch_process_pdfs(input_dir: str, output_dir: str, operations: list[str]): """Process multiple PDFs with specified operations.""" Path(output_dir).mkdir(parents=True, exist_ok=True) pdf_files = list(Path(input_dir).glob('*.pdf')) def process_one(pdf_path: Path) -> dict: try: result = robust_extract(str(pdf_path)) output_path = Path(output_dir) / f'{pdf_path.stem}_processed.json' if 'extract_text' in operations: text = '\n'.join(p['text'] for p in result['pages']) (Path(output_dir) / f'{pdf_path.stem}.txt').write_text(text) if 'extract_tables' in operations: import json tables_path = Path(output_dir) / f'{pdf_path.stem}_tables.json' tables_path.write_text(json.dumps(result['tables'], indent=2)) return {'file': pdf_path.name, 'status': 'success', 'pages': len(result['pages'])} except Exception as e: return {'file': pdf_path.name, 'status': 'error', 'error': str(e)} with ProcessPoolExecutor(max_workers=4) as executor: results = list(executor.map(process_one, pdf_files)) return results
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
extractionLib | string | 'pdfplumber' | Text extraction: pdfplumber or pymupdf |
manipulationLib | string | 'PyPDF2' | PDF manipulation: PyPDF2 or pypdf |
creationLib | string | 'reportlab' | PDF creation: reportlab or fpdf2 |
ocrFallback | boolean | true | Enable OCR for image-only pages |
parallelWorkers | number | 4 | Concurrent workers for batch processing |
maxFileSize | number | 100 | Max PDF size in MB |
Best Practices
-
Use pdfplumber for extraction, PyPDF2 for manipulation, reportlab for creation — Each library excels at its specialty. Using the right tool for each operation produces the best results and simplest code.
-
Implement a processing pipeline with error recovery — Production PDF processing encounters corrupt files, password-protected PDFs, and encoding issues. Wrap each step in try/except, log failures, and continue processing remaining files.
-
Validate extracted data before downstream use — PDF text extraction is lossy. Verify that extracted amounts, dates, and identifiers match expected patterns. Flag anomalies for manual review rather than processing incorrect data silently.
-
Process pages as streams for large PDFs — Loading all pages of a 500-page PDF into memory simultaneously exhausts RAM. Process pages one at a time and write output incrementally.
-
Test with diverse PDF sources — PDFs created by different applications (Word, LaTeX, web browsers, scanners) have different internal structures. Test your extraction code with PDFs from all expected sources.
Common Issues
Password-protected PDFs fail silently — PyPDF2 raises PdfReadError for encrypted PDFs. Check for encryption first with reader.is_encrypted and attempt decryption with a provided password before processing.
Text extraction returns text in wrong order — PDFs don't store text in reading order. pdfplumber uses spatial analysis but can fail with complex multi-column layouts. Sort extracted text blocks by position (top-to-bottom, left-to-right) for more natural ordering.
Generated PDFs are very large — Embedded images without compression inflate file size. Compress images before embedding, use JPEG for photos and PNG for diagrams. Set image DPI to 150 for screen viewing, 300 for print.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.