S

Specialist Ocr Grammar Fixer

Powerful agent for text, correction, specialist, proactively. Includes structured workflows, validation checks, and reusable patterns for ocr extraction team.

AgentClipticsocr extraction teamv1.0.0MIT
0 views0 copies

Specialist Ocr Grammar Fixer

Post-processing agent that detects and corrects character-level OCR artifacts, word boundary errors, and punctuation displacement in extracted text.

When to Use This Agent

Choose this agent when you need to:

  • Clean up garbled OCR output where characters like "rn" are misread as "m" or "cl" as "d"
  • Fix word boundary errors including missing spaces, merged words, and incorrectly split tokens
  • Restore proper capitalization and punctuation displaced by optical recognition failures
  • Process business and marketing documents where industry terminology must be accurately restored

Consider alternatives when:

  • Your text is already clean and you need markdown formatting (use Markdown Syntax Strategist)
  • You need to analyze page layout before extraction (use Document Structure Analyzer Companion)

Quick Start

Configuration

name: specialist-ocr-grammar-fixer type: agent category: ocr-extraction-team

Example Invocation

claude agent:invoke specialist-ocr-grammar-fixer "Fix OCR errors in scanned-contract-page3.txt"

Example Output

=== OCR Grammar Correction Report ===
Input: scanned-contract-page3.txt (1,847 words)

CORRECTIONS APPLIED: 43 total
  Character confusions: 18
    - "irnportant" → "important" (rn→m)
    - "Iicense" → "license" (I→l)
    - "c0ntract" β†’ "contract" (0β†’o)

  Word boundary fixes: 12
    - "ofthe" β†’ "of the" (missing space)
    - "re ceive" β†’ "receive" (extra space)
    - "in voicing" β†’ "invoicing" (incorrect split)

  Punctuation repairs: 8
    - ".The" β†’ ". The" (missing space after period)
    - "amount,," β†’ "amount," (duplicate comma)

  Capitalization fixes: 5
    - "monday" β†’ "Monday"
    - "jANuary" β†’ "January"

CONFIDENCE: 94.7% (41 of 43 corrections high-confidence)
LOW-CONFIDENCE FLAGS: 2 corrections marked for human review

Core Concepts

OCR Error Categories Overview

AspectDetails
Character ConfusionVisually similar characters swapped: rn/m, l/I/1, 0/O, cl/d
Word BoundariesMissing or extra spaces causing merged or split words
Punctuation DisplacementPeriods, commas, and quotes shifted, duplicated, or dropped
Case CorruptionRandom capitalization changes from recognition uncertainty

Correction Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Raw OCR     │────▢│  Character  β”‚
β”‚  Text Input  β”‚     β”‚  Analyzer   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                   β”‚
        β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Context     │────▢│  Grammar    β”‚
β”‚  Engine      β”‚     β”‚  Restorer   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

ParameterTypeDefaultDescription
domain_dictionarystring"business"Industry terminology dictionary: business, legal, medical, or technical
confidence_thresholdfloat0.85Minimum confidence score for auto-applying a correction
flag_low_confidencebooleantrueMark corrections below threshold for human review instead of applying
preserve_formattingbooleantrueMaintain original line breaks, indentation, and whitespace structure
max_corrections_per_wordinteger2Maximum character-level corrections applied to a single word

Best Practices

  1. Select the Correct Domain Dictionary OCR correction relies heavily on context to disambiguate visually similar characters. A legal dictionary resolves "ternns" to "terms" while a medical dictionary might resolve "rnedical" to "medical." Choosing the wrong domain leads to incorrect corrections.

  2. Review Low-Confidence Corrections Before Accepting Corrections below the confidence threshold are flagged rather than applied automatically. These often involve ambiguous cases where multiple valid words could match the garbled input. Human judgment is essential for these edge cases.

  3. Run Grammar Fixing Before Markdown Formatting Character-level errors distort heading detection and list identification in downstream formatters. Cleaning OCR artifacts first ensures that the markdown strategist receives text that accurately represents the document's intended words and structure.

  4. Process Documents Page by Page for Long Files Very long documents can overwhelm the context window needed for accurate correction. Processing one page at a time provides better surrounding context for each correction and produces more reliable results.

  5. Build Custom Dictionary Extensions for Specialized Vocabulary If your documents contain proprietary product names, acronyms, or jargon not in the standard dictionaries, add a custom terminology file. This prevents the agent from "correcting" legitimate specialized terms into common English words.

Common Issues

  1. Agent over-corrects intentional abbreviations and acronyms Short abbreviations like "mgmt" or "dept" may be flagged as OCR errors. Add frequently used abbreviations to a whitelist file referenced by the agent to prevent false corrections on intentional shorthand.

  2. Character confusion corrections cascade incorrectly Fixing one character can make adjacent characters appear wrong, triggering a chain of incorrect corrections. The max_corrections_per_word parameter limits cascading, but severely garbled words may need manual intervention.

  3. Mixed-language documents produce erratic corrections The agent assumes a single language per document. Documents containing passages in multiple languages confuse the context engine. Split multilingual documents into language-homogeneous sections before processing.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates