Architect Ocr Helper

End-to-end OCR pipeline orchestration agent that coordinates document analysis, text extraction, error correction, and formatted output generation across multi-page document batches.

When to Use This Agent

Choose this agent when you need to:

Process large batches of scanned documents through a complete OCR pipeline from image to clean text
Orchestrate multiple OCR specialist agents in the correct sequence for optimal results
Configure and manage OCR engine parameters for different document types and quality levels
Generate structured output in multiple formats (markdown, JSON, plain text) from scanned originals

Consider alternatives when:

You only need one specific step like grammar correction (use Specialist OCR Grammar Fixer directly)
Your documents are already digital text and do not require optical character recognition

Quick Start

Configuration


name: architect-ocr-helper
type: agent
category: ocr-extraction-team

Example Invocation


claude agent:invoke architect-ocr-helper "Process the scanned invoices in /docs/batch-042/ and output clean markdown"

Example Output

=== OCR Pipeline Execution Report ===
Batch: /docs/batch-042/ (24 documents, 87 pages total)

PIPELINE STAGES:
  1. Structure Analysis:  87/87 pages analyzed (avg confidence: 0.93)
  2. OCR Extraction:      87/87 pages extracted (engine: Tesseract 5.x)
  3. Grammar Correction:  1,247 fixes applied across 24 documents
  4. Markdown Formatting: 24 documents formatted with heading hierarchy

RESULTS:
  Output directory: /docs/batch-042/output/
  Format: markdown (.md files)
  Total words extracted: 48,392
  Overall accuracy estimate: 96.8%

FLAGGED FOR REVIEW: 3 documents with pages below 90% confidence
  - invoice-2025-0417.pdf (page 2: 84.2% — heavy watermark)
  - receipt-scan-089.jpg (single page: 87.1% — low resolution)
  - contract-amendment.pdf (page 5: 88.9% — handwritten notes)

Core Concepts

Pipeline Stage Sequencing Overview

Aspect	Details
Stage 1: Structure Analysis	Region segmentation, reading order, and template matching
Stage 2: Text Extraction	OCR engine processes each region with type-appropriate settings
Stage 3: Error Correction	Character confusion, word boundary, and punctuation repair
Stage 4: Format Output	Clean text converted to target format with proper structure

OCR Pipeline Architecture

┌─────────────┐     ┌─────────────┐
│  Document    │────▶│  Structure  │
│  Ingestion   │     │  Analyzer   │
└─────────────┘     └─────────────┘
        │                   │
        ▼                   ▼
┌─────────────┐     ┌─────────────┐
│  OCR Engine  │────▶│  Post-      │
│  Extraction  │     │  Processor  │
└─────────────┘     └─────────────┘

Configuration

Parameter	Type	Default	Description
ocr_engine	string	"tesseract"	OCR engine to use: tesseract, easyocr, or paddleocr
output_format	string	"markdown"	Target output format: markdown, json, plaintext, or docx
batch_concurrency	integer	4	Number of documents to process in parallel within a batch
quality_threshold	float	0.90	Minimum per-page accuracy score before flagging for manual review
enable_preprocessing	boolean	true	Apply image preprocessing (deskew, denoise, contrast) before OCR

Best Practices

Enable Image Preprocessing for Scanned Documents Raw scans often have skew, noise, and low contrast that degrade OCR accuracy. The preprocessing stage applies deskew correction, adaptive thresholding, and noise reduction that can improve character recognition rates by 10-15%.
Match OCR Engine to Document Characteristics Tesseract excels on clean printed text in Latin scripts. EasyOCR handles multilingual documents and curved text better. PaddleOCR offers superior performance on dense Asian-language documents. Select the engine that matches your document profile.
Set Batch Concurrency Based on System Resources Each concurrent OCR process consumes significant CPU and memory. On machines with 8 GB RAM, keep batch_concurrency at 2. Systems with 32 GB or more can safely handle 6-8 concurrent document processes without thrashing.
Use the Quality Threshold to Prioritize Human Review Not every document needs manual verification. Setting quality_threshold to 0.90 ensures that only pages with questionable accuracy are flagged, while high-confidence pages flow through without bottlenecking the review queue.
Archive Original Scans Alongside Extracted Text Always retain the original image or PDF files after OCR processing. Extracted text may contain undetected errors, and having the originals available allows re-processing with improved settings or different engines in the future.

Common Issues

Batch processing fails midway on a corrupted file A single corrupted PDF or unsupported image format can halt the entire batch. Enable skip_on_error: true to allow the pipeline to log the failure and continue processing remaining documents rather than aborting the batch.
Low accuracy on documents with watermarks or background patterns Watermarks and textured backgrounds confuse OCR engines by introducing noise characters. Enable remove_background: true in the preprocessing stage to apply background subtraction before text extraction.
Output markdown has incorrect heading levels across merged multi-page documents When multiple pages are combined into a single output file, heading hierarchies from individual pages may conflict. Set normalize_headings: true to rewrite heading levels so they follow a consistent hierarchy across the merged document.

⚠️ Loading Issue

Architect Ocr Helper

Architect Ocr Helper

When to Use This Agent

Quick Start

Configuration

Example Invocation

Example Output

Core Concepts

Pipeline Stage Sequencing Overview

OCR Pipeline Architecture

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner