Pdf Processing System
Battle-tested skill for extract, text, tables, files. Includes structured workflows, validation checks, and reusable patterns for document processing.
PDF Processing System
A streamlined skill for common PDF processing tasks using Python. Covers text extraction, page manipulation, basic creation, and format conversion with minimal setup and clear examples.
When to Use This Skill
Choose this skill when:
- Quickly extracting text from PDF files for processing
- Performing basic PDF operations (merge, split, rotate)
- Converting PDFs to text or images
- Filling simple PDF forms
- Building lightweight PDF processing scripts
Consider alternatives when:
- Need advanced AI-assisted analysis → use a PDF Anthropic skill
- Creating complex PDF reports → use a reportlab/PDF creation skill
- Processing batches of documents at scale → use a PDF pipeline skill
- Working with scanned documents → use an OCR-focused skill
Quick Start
import pdfplumber # Extract text from a PDF with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() if text: print(text)
# Merge multiple PDFs from PyPDF2 import PdfReader, PdfWriter writer = PdfWriter() for pdf_file in ['part1.pdf', 'part2.pdf', 'part3.pdf']: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) with open('combined.pdf', 'wb') as output: writer.write(output)
Core Concepts
Common Operations
| Task | Code |
|---|---|
| Extract text | pdfplumber.open(f).pages[0].extract_text() |
| Page count | len(PdfReader(f).pages) |
| Merge PDFs | Loop: writer.add_page(page) |
| Split PDF | Each page → separate PdfWriter |
| Rotate page | page.rotate(90) |
| Get metadata | PdfReader(f).metadata |
| Encrypt | writer.encrypt('password') |
| Extract images | page.images (pdfplumber) |
| PDF to images | pdf2image.convert_from_path(f) |
Text Processing Pipeline
def process_pdf_text(pdf_path: str) -> dict: """Extract and structure text content from a PDF.""" with pdfplumber.open(pdf_path) as pdf: pages = [] for i, page in enumerate(pdf.pages): text = page.extract_text() or '' tables = page.extract_tables() or [] pages.append({ 'page_number': i + 1, 'text': text, 'word_count': len(text.split()), 'has_tables': len(tables) > 0, 'table_count': len(tables), }) return { 'total_pages': len(pages), 'total_words': sum(p['word_count'] for p in pages), 'pages': pages, }
Format Conversion
# PDF to text pdftotext document.pdf output.txt # PDF to images (one per page) pdftoppm document.pdf output -png -r 300 # PDF to HTML pdftohtml document.pdf output.html # Images to PDF img2pdf *.png -o output.pdf # HTML to PDF wkhtmltopdf page.html output.pdf
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
extractionLib | string | 'pdfplumber' | Library: pdfplumber or PyPDF2 |
dpi | number | 300 | DPI for PDF-to-image conversion |
encoding | string | 'utf-8' | Output text encoding |
pageRange | string | 'all' | Pages to process: all, 1-5, or specific numbers |
preserveLayout | boolean | false | Attempt to preserve spatial text layout |
Best Practices
-
Use pdfplumber for reading, PyPDF2 for writing — pdfplumber is superior for text and table extraction. PyPDF2 is better for structural operations (merge, split, encrypt, rotate).
-
Check for empty text before processing — Some PDF pages are images with no extractable text. Always verify
page.extract_text()returns non-empty content before downstream processing. -
Handle exceptions per-page rather than per-document — One corrupted page shouldn't prevent processing the other 99 pages. Wrap individual page processing in try/except and log errors per page.
-
Use CLI tools for batch format conversion — Python libraries work for individual files, but
pdftotext,pdftoppm, andwkhtmltopdfare faster and more reliable for batch operations. -
Close PDF files explicitly or use context managers — PDF files lock the underlying file handle. Always use
withstatements or explicitly close readers to prevent file locking issues.
Common Issues
pdfplumber crashes on specific PDFs — Some PDFs have malformed internal structures. Wrap extraction in try/except and fall back to PyPDF2 or CLI tools (pdftotext) for problematic files.
Text extraction misses content in text boxes or annotations — pdfplumber extracts body text but may miss text in annotations, comments, or form fields. Check page.annots for annotation content.
Large PDFs consume excessive memory — Process pages one at a time rather than loading all pages simultaneously. Use generators for lazy page processing.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.