Pdf Processing Pro Toolkit
Boost productivity using this production, ready, processing, forms. Includes structured workflows, validation checks, and reusable patterns for document processing.
PDF Processing Pro Toolkit
A production-ready PDF processing skill with pre-built scripts, comprehensive error handling, and support for complex document workflows including OCR, template matching, document classification, and automated data extraction.
When to Use This Skill
Choose this skill when:
- Building production document processing systems
- Classifying and routing different document types automatically
- Extracting data from invoices, receipts, and forms at scale
- Implementing OCR with post-processing and validation
- Creating document processing pipelines with monitoring
Consider alternatives when:
- Simple one-off text extraction → use a basic PDF skill
- Creating PDFs → use a PDF creation skill
- Working with Word documents → use a DOCX skill
- Need browser-based PDF viewing → use PDF.js
Quick Start
# Production PDF processing pipeline from dataclasses import dataclass from enum import Enum import pdfplumber import logging logger = logging.getLogger(__name__) class DocumentType(Enum): INVOICE = 'invoice' RECEIPT = 'receipt' CONTRACT = 'contract' REPORT = 'report' UNKNOWN = 'unknown' @dataclass class ProcessingResult: file_path: str doc_type: DocumentType pages: int extracted_data: dict confidence: float errors: list[str] def classify_document(text: str) -> tuple[DocumentType, float]: """Classify document type from extracted text.""" keywords = { DocumentType.INVOICE: ['invoice', 'bill to', 'due date', 'total amount'], DocumentType.RECEIPT: ['receipt', 'payment received', 'transaction'], DocumentType.CONTRACT: ['agreement', 'parties', 'terms and conditions'], DocumentType.REPORT: ['report', 'executive summary', 'findings'], } scores = {} text_lower = text.lower() for doc_type, words in keywords.items(): score = sum(1 for w in words if w in text_lower) / len(words) scores[doc_type] = score best_type = max(scores, key=scores.get) confidence = scores[best_type] if confidence < 0.3: return DocumentType.UNKNOWN, confidence return best_type, confidence
Core Concepts
Document Processing Pipeline
| Stage | Component | Output |
|---|---|---|
| Ingestion | File reader + validation | Validated PDF file |
| Classification | Keyword + layout analysis | Document type + confidence |
| Extraction | Type-specific extractor | Structured data dictionary |
| Validation | Schema + business rules | Validated data + error list |
| Output | Formatter + writer | JSON, CSV, or database record |
| Monitoring | Logger + metrics | Processing stats + alerts |
Template-Based Extraction
class InvoiceExtractor: """Extract structured data from invoices using layout analysis.""" def extract(self, pdf_path: str) -> dict: with pdfplumber.open(pdf_path) as pdf: first_page = pdf.pages[0] text = first_page.extract_text() or '' tables = first_page.extract_tables() data = { 'invoice_number': self._find_pattern(text, r'Invoice\s*#?\s*[:.]?\s*(\S+)'), 'date': self._find_pattern(text, r'Date\s*[:.]?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})'), 'total': self._find_pattern(text, r'Total\s*[:.]?\s*\$?([\d,]+\.?\d*)'), 'line_items': self._extract_line_items(tables), } return data def _find_pattern(self, text: str, pattern: str) -> str | None: import re match = re.search(pattern, text, re.IGNORECASE) return match.group(1) if match else None def _extract_line_items(self, tables: list) -> list[dict]: items = [] for table in tables: if not table or len(table) < 2: continue headers = [h.lower() if h else '' for h in table[0]] if any('description' in h or 'item' in h for h in headers): for row in table[1:]: if any(cell for cell in row): items.append(dict(zip(headers, row))) return items
Error Recovery and Monitoring
import time from collections import Counter class ProcessingMonitor: def __init__(self): self.stats = Counter() self.errors = [] self.start_time = time.time() def record_success(self, doc_type: str): self.stats['total'] += 1 self.stats['success'] += 1 self.stats[f'type:{doc_type}'] += 1 def record_error(self, file_path: str, error: str): self.stats['total'] += 1 self.stats['errors'] += 1 self.errors.append({'file': file_path, 'error': error, 'time': time.time()}) def report(self) -> dict: elapsed = time.time() - self.start_time return { 'total_processed': self.stats['total'], 'successful': self.stats['success'], 'failed': self.stats['errors'], 'success_rate': self.stats['success'] / max(self.stats['total'], 1), 'elapsed_seconds': round(elapsed, 2), 'docs_per_second': round(self.stats['total'] / max(elapsed, 1), 2), 'errors': self.errors[-10:], # Last 10 errors }
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
classificationMethod | string | 'keyword' | Method: keyword, layout, or ml-model |
extractorMap | object | {} | Document type → extractor class mapping |
ocrEnabled | boolean | true | Enable OCR for image-based pages |
validationStrict | boolean | false | Fail on validation errors vs warn |
batchSize | number | 50 | Documents per processing batch |
retryOnError | number | 2 | Retry attempts for failed documents |
Best Practices
-
Classify documents before extracting data — Different document types require different extraction strategies. Classify first (invoice vs contract vs report), then apply the appropriate extractor for that document type.
-
Use regex patterns for structured fields, AI for unstructured content — Invoice numbers, dates, and amounts follow patterns that regex captures reliably. Summaries, descriptions, and context require AI analysis. Use the right tool for each field type.
-
Build monitoring into the pipeline from day one — Track processing rate, success rate, error frequency, and processing time per document. Alerts on declining success rates catch issues before they affect downstream systems.
-
Implement a dead letter queue for failed documents — Documents that fail processing after retries should be saved to a separate queue for manual review, not silently dropped. Include the error message and the stage at which processing failed.
-
Version your extraction templates — Invoice formats change when vendors update their systems. Version extraction patterns and maintain multiple versions for the same document source. Log which template version successfully extracted each document.
Common Issues
Extraction accuracy drops when vendor changes invoice layout — Create multiple extraction patterns per vendor and try them in order. Track extraction confidence scores and alert when confidence drops below threshold for a specific vendor.
OCR introduces errors in numeric fields — Common OCR errors: 0 ↔ O, 1 ↔ l, 5 ↔ S. Post-process OCR output with domain-specific validation: amounts should be valid numbers, dates should parse correctly, totals should match line item sums.
Processing pipeline bottlenecked on I/O — PDF reading is I/O-bound while extraction is CPU-bound. Use async I/O for file reading and process pools for CPU-heavy extraction. Pipeline the stages so I/O and CPU work overlap.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.