Pdf System
Battle-tested skill for comprehensive, manipulation, toolkit, extracting. Includes structured workflows, validation checks, and reusable patterns for document processing.
PDF System
A practical skill for PDF processing including creation, text extraction, merging, splitting, form filling, and conversion. Covers both Python and command-line approaches for common PDF operations.
When to Use This Skill
Choose this skill when:
- Extracting text and metadata from PDF documents
- Creating PDF reports with tables, charts, and formatting
- Merging, splitting, or rotating PDF files
- Filling PDF forms programmatically
- Converting between PDF and other formats (images, HTML, DOCX)
Consider alternatives when:
- Working with DOCX files → use a DOCX skill
- Creating presentations → use a PPTX skill
- Building a web-based PDF viewer → use a PDF.js skill
- Need OCR for scanned documents → use an OCR skill
Quick Start
# Install PDF tools pip install PyPDF2 reportlab pdfplumber # CLI tools brew install poppler # pdftotext, pdfimages brew install qpdf # merge, split, encrypt
# Extract text from PDF import pdfplumber def extract_text(pdf_path: str) -> str: with pdfplumber.open(pdf_path) as pdf: text = '' for page in pdf.pages: text += page.extract_text() or '' text += '\n\n' return text.strip() # Extract tables from PDF def extract_tables(pdf_path: str) -> list: with pdfplumber.open(pdf_path) as pdf: all_tables = [] for page in pdf.pages: tables = page.extract_tables() all_tables.extend(tables) return all_tables
Core Concepts
PDF Operations Matrix
| Operation | Python Library | CLI Tool |
|---|---|---|
| Read text | pdfplumber, PyPDF2 | pdftotext (poppler) |
| Read tables | pdfplumber, camelot | tabula-java |
| Create PDF | reportlab, fpdf2 | wkhtmltopdf |
| Merge PDFs | PyPDF2 | qpdf, pdfunite |
| Split PDF | PyPDF2 | qpdf |
| Form filling | PyPDF2, pdfrw | pdftk |
| Encrypt | PyPDF2 | qpdf |
| Images to PDF | Pillow + reportlab | img2pdf |
| PDF to images | pdf2image | pdftoppm |
PDF Report Generation
from reportlab.lib import colors from reportlab.lib.pagesizes import A4 from reportlab.lib.styles import getSampleStyleSheet from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer def create_report(output_path: str, title: str, data: list[list]): doc = SimpleDocTemplate(output_path, pagesize=A4) styles = getSampleStyleSheet() elements = [] # Title elements.append(Paragraph(title, styles['Title'])) elements.append(Spacer(1, 20)) # Table table = Table(data) table.setStyle(TableStyle([ ('BACKGROUND', (0, 0), (-1, 0), colors.grey), ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke), ('ALIGN', (0, 0), (-1, -1), 'CENTER'), ('FONTSIZE', (0, 0), (-1, 0), 12), ('BOTTOMPADDING', (0, 0), (-1, 0), 12), ('GRID', (0, 0), (-1, -1), 1, colors.black), ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]), ])) elements.append(table) doc.build(elements) # Usage create_report('report.pdf', 'Sales Report Q1', [ ['Product', 'Units', 'Revenue'], ['Widget A', '1,200', '$36,000'], ['Widget B', '800', '$24,000'], ])
PDF Manipulation
from PyPDF2 import PdfReader, PdfWriter def merge_pdfs(paths: list[str], output: str): writer = PdfWriter() for path in paths: reader = PdfReader(path) for page in reader.pages: writer.add_page(page) with open(output, 'wb') as f: writer.write(f) def split_pdf(path: str, output_dir: str): reader = PdfReader(path) for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f'{output_dir}/page_{i+1}.pdf', 'wb') as f: writer.write(f) def encrypt_pdf(path: str, password: str, output: str): reader = PdfReader(path) writer = PdfWriter() for page in reader.pages: writer.add_page(page) writer.encrypt(password) with open(output, 'wb') as f: writer.write(f)
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
extractionEngine | string | 'pdfplumber' | Text extraction: pdfplumber, PyPDF2, or poppler |
creationEngine | string | 'reportlab' | PDF creation: reportlab or fpdf2 |
pageSize | string | 'A4' | Default page size: A4, Letter, or Legal |
ocrFallback | boolean | false | Use OCR when text extraction fails |
imageFormat | string | 'png' | Format for PDF-to-image conversion |
imageDPI | number | 300 | DPI for image conversion |
Best Practices
-
Use pdfplumber for text extraction, not PyPDF2 — pdfplumber handles complex layouts, multi-column text, and tables significantly better than PyPDF2's basic text extraction. PyPDF2 is better for structural manipulation (merge, split, encrypt).
-
Handle scanned PDFs with OCR fallback — Many PDFs are scanned images with no extractable text. Check if extracted text is empty and fall back to Tesseract OCR:
pytesseract.image_to_string(page_image). -
Use reportlab for complex PDF creation, fpdf2 for simple ones — reportlab offers full layout control with tables, charts, and precise positioning. fpdf2 is simpler for text-heavy documents. Choose based on complexity requirements.
-
Process large PDFs page by page, not all at once — Loading a 1000-page PDF entirely into memory can exhaust RAM. Process pages iteratively using generators and write output incrementally.
-
Validate PDF output by opening in multiple viewers — PDFs that render correctly in one viewer may break in another. Test generated PDFs in Adobe Reader, Chrome's built-in viewer, and Preview (macOS) to ensure compatibility.
Common Issues
Text extraction returns garbled characters — The PDF may use custom font encoding or non-standard character mappings. Try different extraction libraries (pdfplumber vs poppler's pdftotext). Some PDFs require OCR even though they appear to contain text.
Merged PDF has incorrect page orientation — Source PDFs may have different page sizes or orientations. Check and normalize page dimensions before merging, or rotate pages to match a target orientation.
Table extraction misaligns columns — PDF tables don't have true table structure — they're just positioned text. pdfplumber's extract_tables() uses heuristics that can fail on complex layouts. Specify explicit table boundaries with table_settings for better accuracy.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.