Specialist Ocr Grammar Fixer

Post-processing agent that detects and corrects character-level OCR artifacts, word boundary errors, and punctuation displacement in extracted text.

When to Use This Agent

Choose this agent when you need to:

Clean up garbled OCR output where characters like "rn" are misread as "m" or "cl" as "d"
Fix word boundary errors including missing spaces, merged words, and incorrectly split tokens
Restore proper capitalization and punctuation displaced by optical recognition failures
Process business and marketing documents where industry terminology must be accurately restored

Consider alternatives when:

Your text is already clean and you need markdown formatting (use Markdown Syntax Strategist)
You need to analyze page layout before extraction (use Document Structure Analyzer Companion)

Quick Start

Configuration


name: specialist-ocr-grammar-fixer
type: agent
category: ocr-extraction-team

Example Invocation


claude agent:invoke specialist-ocr-grammar-fixer "Fix OCR errors in scanned-contract-page3.txt"

Example Output

=== OCR Grammar Correction Report ===
Input: scanned-contract-page3.txt (1,847 words)

CORRECTIONS APPLIED: 43 total
  Character confusions: 18
    - "irnportant" → "important" (rn→m)
    - "Iicense" → "license" (I→l)
    - "c0ntract" → "contract" (0→o)

  Word boundary fixes: 12
    - "ofthe" → "of the" (missing space)
    - "re ceive" → "receive" (extra space)
    - "in voicing" → "invoicing" (incorrect split)

  Punctuation repairs: 8
    - ".The" → ". The" (missing space after period)
    - "amount,," → "amount," (duplicate comma)

  Capitalization fixes: 5
    - "monday" → "Monday"
    - "jANuary" → "January"

CONFIDENCE: 94.7% (41 of 43 corrections high-confidence)
LOW-CONFIDENCE FLAGS: 2 corrections marked for human review

Core Concepts

OCR Error Categories Overview

Aspect	Details
Character Confusion	Visually similar characters swapped: rn/m, l/I/1, 0/O, cl/d
Word Boundaries	Missing or extra spaces causing merged or split words
Punctuation Displacement	Periods, commas, and quotes shifted, duplicated, or dropped
Case Corruption	Random capitalization changes from recognition uncertainty

Correction Pipeline Architecture

┌─────────────┐     ┌─────────────┐
│  Raw OCR     │────▶│  Character  │
│  Text Input  │     │  Analyzer   │
└─────────────┘     └─────────────┘
        │                   │
        ▼                   ▼
┌─────────────┐     ┌─────────────┐
│  Context     │────▶│  Grammar    │
│  Engine      │     │  Restorer   │
└─────────────┘     └─────────────┘

Configuration

Parameter	Type	Default	Description
domain_dictionary	string	"business"	Industry terminology dictionary: business, legal, medical, or technical
confidence_threshold	float	0.85	Minimum confidence score for auto-applying a correction
flag_low_confidence	boolean	true	Mark corrections below threshold for human review instead of applying
preserve_formatting	boolean	true	Maintain original line breaks, indentation, and whitespace structure
max_corrections_per_word	integer	2	Maximum character-level corrections applied to a single word

Best Practices

Select the Correct Domain Dictionary OCR correction relies heavily on context to disambiguate visually similar characters. A legal dictionary resolves "ternns" to "terms" while a medical dictionary might resolve "rnedical" to "medical." Choosing the wrong domain leads to incorrect corrections.
Review Low-Confidence Corrections Before Accepting Corrections below the confidence threshold are flagged rather than applied automatically. These often involve ambiguous cases where multiple valid words could match the garbled input. Human judgment is essential for these edge cases.
Run Grammar Fixing Before Markdown Formatting Character-level errors distort heading detection and list identification in downstream formatters. Cleaning OCR artifacts first ensures that the markdown strategist receives text that accurately represents the document's intended words and structure.
Process Documents Page by Page for Long Files Very long documents can overwhelm the context window needed for accurate correction. Processing one page at a time provides better surrounding context for each correction and produces more reliable results.
Build Custom Dictionary Extensions for Specialized Vocabulary If your documents contain proprietary product names, acronyms, or jargon not in the standard dictionaries, add a custom terminology file. This prevents the agent from "correcting" legitimate specialized terms into common English words.

Common Issues

Agent over-corrects intentional abbreviations and acronyms Short abbreviations like "mgmt" or "dept" may be flagged as OCR errors. Add frequently used abbreviations to a whitelist file referenced by the agent to prevent false corrections on intentional shorthand.
Character confusion corrections cascade incorrectly Fixing one character can make adjacent characters appear wrong, triggering a chain of incorrect corrections. The max_corrections_per_word parameter limits cascading, but severely garbled words may need manual intervention.
Mixed-language documents produce erratic corrections The agent assumes a single language per document. Documents containing passages in multiple languages confuse the context engine. Split multilingual documents into language-homogeneous sections before processing.

⚠️ Loading Issue

Specialist Ocr Grammar Fixer

Specialist Ocr Grammar Fixer

When to Use This Agent

Quick Start

Configuration

Example Invocation

Example Output

Core Concepts

OCR Error Categories Overview

Correction Pipeline Architecture

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner