A

Architect Ocr Helper

Battle-tested agent for preprocessing, image, optimization, specialist. Includes structured workflows, validation checks, and reusable patterns for ocr extraction team.

AgentClipticsocr extraction teamv1.0.0MIT
0 views0 copies

Architect Ocr Helper

End-to-end OCR pipeline orchestration agent that coordinates document analysis, text extraction, error correction, and formatted output generation across multi-page document batches.

When to Use This Agent

Choose this agent when you need to:

  • Process large batches of scanned documents through a complete OCR pipeline from image to clean text
  • Orchestrate multiple OCR specialist agents in the correct sequence for optimal results
  • Configure and manage OCR engine parameters for different document types and quality levels
  • Generate structured output in multiple formats (markdown, JSON, plain text) from scanned originals

Consider alternatives when:

  • You only need one specific step like grammar correction (use Specialist OCR Grammar Fixer directly)
  • Your documents are already digital text and do not require optical character recognition

Quick Start

Configuration

name: architect-ocr-helper type: agent category: ocr-extraction-team

Example Invocation

claude agent:invoke architect-ocr-helper "Process the scanned invoices in /docs/batch-042/ and output clean markdown"

Example Output

=== OCR Pipeline Execution Report ===
Batch: /docs/batch-042/ (24 documents, 87 pages total)

PIPELINE STAGES:
  1. Structure Analysis:  87/87 pages analyzed (avg confidence: 0.93)
  2. OCR Extraction:      87/87 pages extracted (engine: Tesseract 5.x)
  3. Grammar Correction:  1,247 fixes applied across 24 documents
  4. Markdown Formatting: 24 documents formatted with heading hierarchy

RESULTS:
  Output directory: /docs/batch-042/output/
  Format: markdown (.md files)
  Total words extracted: 48,392
  Overall accuracy estimate: 96.8%

FLAGGED FOR REVIEW: 3 documents with pages below 90% confidence
  - invoice-2025-0417.pdf (page 2: 84.2% β€” heavy watermark)
  - receipt-scan-089.jpg (single page: 87.1% β€” low resolution)
  - contract-amendment.pdf (page 5: 88.9% β€” handwritten notes)

Core Concepts

Pipeline Stage Sequencing Overview

AspectDetails
Stage 1: Structure AnalysisRegion segmentation, reading order, and template matching
Stage 2: Text ExtractionOCR engine processes each region with type-appropriate settings
Stage 3: Error CorrectionCharacter confusion, word boundary, and punctuation repair
Stage 4: Format OutputClean text converted to target format with proper structure

OCR Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Document    │────▢│  Structure  β”‚
β”‚  Ingestion   β”‚     β”‚  Analyzer   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                   β”‚
        β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  OCR Engine  │────▢│  Post-      β”‚
β”‚  Extraction  β”‚     β”‚  Processor  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

ParameterTypeDefaultDescription
ocr_enginestring"tesseract"OCR engine to use: tesseract, easyocr, or paddleocr
output_formatstring"markdown"Target output format: markdown, json, plaintext, or docx
batch_concurrencyinteger4Number of documents to process in parallel within a batch
quality_thresholdfloat0.90Minimum per-page accuracy score before flagging for manual review
enable_preprocessingbooleantrueApply image preprocessing (deskew, denoise, contrast) before OCR

Best Practices

  1. Enable Image Preprocessing for Scanned Documents Raw scans often have skew, noise, and low contrast that degrade OCR accuracy. The preprocessing stage applies deskew correction, adaptive thresholding, and noise reduction that can improve character recognition rates by 10-15%.

  2. Match OCR Engine to Document Characteristics Tesseract excels on clean printed text in Latin scripts. EasyOCR handles multilingual documents and curved text better. PaddleOCR offers superior performance on dense Asian-language documents. Select the engine that matches your document profile.

  3. Set Batch Concurrency Based on System Resources Each concurrent OCR process consumes significant CPU and memory. On machines with 8 GB RAM, keep batch_concurrency at 2. Systems with 32 GB or more can safely handle 6-8 concurrent document processes without thrashing.

  4. Use the Quality Threshold to Prioritize Human Review Not every document needs manual verification. Setting quality_threshold to 0.90 ensures that only pages with questionable accuracy are flagged, while high-confidence pages flow through without bottlenecking the review queue.

  5. Archive Original Scans Alongside Extracted Text Always retain the original image or PDF files after OCR processing. Extracted text may contain undetected errors, and having the originals available allows re-processing with improved settings or different engines in the future.

Common Issues

  1. Batch processing fails midway on a corrupted file A single corrupted PDF or unsupported image format can halt the entire batch. Enable skip_on_error: true to allow the pipeline to log the failure and continue processing remaining documents rather than aborting the batch.

  2. Low accuracy on documents with watermarks or background patterns Watermarks and textured backgrounds confuse OCR engines by introducing noise characters. Enable remove_background: true in the preprocessing stage to apply background subtraction before text extraction.

  3. Output markdown has incorrect heading levels across merged multi-page documents When multiple pages are combined into a single output file, heading hierarchies from individual pages may conflict. Set normalize_headings: true to rewrite heading levels so they follow a consistent hierarchy across the merged document.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates