PDF Processing Pro Toolkit

A production-ready PDF processing skill with pre-built scripts, comprehensive error handling, and support for complex document workflows including OCR, template matching, document classification, and automated data extraction.

When to Use This Skill

Choose this skill when:

Building production document processing systems
Classifying and routing different document types automatically
Extracting data from invoices, receipts, and forms at scale
Implementing OCR with post-processing and validation
Creating document processing pipelines with monitoring

Consider alternatives when:

Simple one-off text extraction → use a basic PDF skill
Creating PDFs → use a PDF creation skill
Working with Word documents → use a DOCX skill
Need browser-based PDF viewing → use PDF.js

Quick Start


# Production PDF processing pipeline
from dataclasses import dataclass
from enum import Enum
import pdfplumber
import logging

logger = logging.getLogger(__name__)

class DocumentType(Enum):
    INVOICE = 'invoice'
    RECEIPT = 'receipt'
    CONTRACT = 'contract'
    REPORT = 'report'
    UNKNOWN = 'unknown'

@dataclass
class ProcessingResult:
    file_path: str
    doc_type: DocumentType
    pages: int
    extracted_data: dict
    confidence: float
    errors: list[str]

def classify_document(text: str) -> tuple[DocumentType, float]:
    """Classify document type from extracted text."""
    keywords = {
        DocumentType.INVOICE: ['invoice', 'bill to', 'due date', 'total amount'],
        DocumentType.RECEIPT: ['receipt', 'payment received', 'transaction'],
        DocumentType.CONTRACT: ['agreement', 'parties', 'terms and conditions'],
        DocumentType.REPORT: ['report', 'executive summary', 'findings'],
    }

    scores = {}
    text_lower = text.lower()
    for doc_type, words in keywords.items():
        score = sum(1 for w in words if w in text_lower) / len(words)
        scores[doc_type] = score

    best_type = max(scores, key=scores.get)
    confidence = scores[best_type]

    if confidence < 0.3:
        return DocumentType.UNKNOWN, confidence
    return best_type, confidence

Core Concepts

Document Processing Pipeline

Stage	Component	Output
Ingestion	File reader + validation	Validated PDF file
Classification	Keyword + layout analysis	Document type + confidence
Extraction	Type-specific extractor	Structured data dictionary
Validation	Schema + business rules	Validated data + error list
Output	Formatter + writer	JSON, CSV, or database record
Monitoring	Logger + metrics	Processing stats + alerts

Template-Based Extraction


class InvoiceExtractor:
    """Extract structured data from invoices using layout analysis."""

    def extract(self, pdf_path: str) -> dict:
        with pdfplumber.open(pdf_path) as pdf:
            first_page = pdf.pages[0]
            text = first_page.extract_text() or ''
            tables = first_page.extract_tables()

            data = {
                'invoice_number': self._find_pattern(text, r'Invoice\s*#?\s*[:.]?\s*(\S+)'),
                'date': self._find_pattern(text, r'Date\s*[:.]?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})'),
                'total': self._find_pattern(text, r'Total\s*[:.]?\s*\$?([\d,]+\.?\d*)'),
                'line_items': self._extract_line_items(tables),
            }

            return data

    def _find_pattern(self, text: str, pattern: str) -> str | None:
        import re
        match = re.search(pattern, text, re.IGNORECASE)
        return match.group(1) if match else None

    def _extract_line_items(self, tables: list) -> list[dict]:
        items = []
        for table in tables:
            if not table or len(table) < 2:
                continue
            headers = [h.lower() if h else '' for h in table[0]]
            if any('description' in h or 'item' in h for h in headers):
                for row in table[1:]:
                    if any(cell for cell in row):
                        items.append(dict(zip(headers, row)))
        return items

Error Recovery and Monitoring


import time
from collections import Counter

class ProcessingMonitor:
    def __init__(self):
        self.stats = Counter()
        self.errors = []
        self.start_time = time.time()

    def record_success(self, doc_type: str):
        self.stats['total'] += 1
        self.stats['success'] += 1
        self.stats[f'type:{doc_type}'] += 1

    def record_error(self, file_path: str, error: str):
        self.stats['total'] += 1
        self.stats['errors'] += 1
        self.errors.append({'file': file_path, 'error': error, 'time': time.time()})

    def report(self) -> dict:
        elapsed = time.time() - self.start_time
        return {
            'total_processed': self.stats['total'],
            'successful': self.stats['success'],
            'failed': self.stats['errors'],
            'success_rate': self.stats['success'] / max(self.stats['total'], 1),
            'elapsed_seconds': round(elapsed, 2),
            'docs_per_second': round(self.stats['total'] / max(elapsed, 1), 2),
            'errors': self.errors[-10:],  # Last 10 errors
        }

Configuration

Parameter	Type	Default	Description
`classificationMethod`	string	`'keyword'`	Method: keyword, layout, or ml-model
`extractorMap`	object	`{}`	Document type → extractor class mapping
`ocrEnabled`	boolean	`true`	Enable OCR for image-based pages
`validationStrict`	boolean	`false`	Fail on validation errors vs warn
`batchSize`	number	`50`	Documents per processing batch
`retryOnError`	number	`2`	Retry attempts for failed documents

Best Practices

Classify documents before extracting data — Different document types require different extraction strategies. Classify first (invoice vs contract vs report), then apply the appropriate extractor for that document type.
Use regex patterns for structured fields, AI for unstructured content — Invoice numbers, dates, and amounts follow patterns that regex captures reliably. Summaries, descriptions, and context require AI analysis. Use the right tool for each field type.
Build monitoring into the pipeline from day one — Track processing rate, success rate, error frequency, and processing time per document. Alerts on declining success rates catch issues before they affect downstream systems.
Implement a dead letter queue for failed documents — Documents that fail processing after retries should be saved to a separate queue for manual review, not silently dropped. Include the error message and the stage at which processing failed.
Version your extraction templates — Invoice formats change when vendors update their systems. Version extraction patterns and maintain multiple versions for the same document source. Log which template version successfully extracted each document.

Common Issues

Extraction accuracy drops when vendor changes invoice layout — Create multiple extraction patterns per vendor and try them in order. Track extraction confidence scores and alert when confidence drops below threshold for a specific vendor.

OCR introduces errors in numeric fields — Common OCR errors: 0 ↔ O, 1 ↔ l, 5 ↔ S. Post-process OCR output with domain-specific validation: amounts should be valid numbers, dates should parse correctly, totals should match line item sums.

Processing pipeline bottlenecked on I/O — PDF reading is I/O-bound while extraction is CPU-bound. Use async I/O for file reading and process pools for CPU-heavy extraction. Pipeline the stages so I/O and CPU work overlap.

⚠️ Loading Issue

Pdf Processing Pro Toolkit

PDF Processing Pro Toolkit

When to Use This Skill

Quick Start

Core Concepts

Document Processing Pipeline

Template-Based Extraction

Error Recovery and Monitoring

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace