T

Text Comparison Assistant

All-in-one agent covering text, comparison, validation, specialist. Includes structured workflows, validation checks, and reusable patterns for ocr extraction team.

AgentClipticsocr extraction teamv1.0.0MIT
0 views0 copies

Text Comparison Assistant

Line-by-line comparison specialist that identifies discrepancies between extracted OCR text and reference documents with categorized severity reporting.

When to Use This Agent

Choose this agent when you need to:

  • Perform systematic diff analysis between raw OCR output and a corrected or reference markdown file
  • Categorize errors by type (spelling, missing content, formatting) and severity (critical, major, minor)
  • Generate an actionable comparison report with line-number references and suggested fixes
  • Establish an accuracy percentage baseline for OCR pipeline quality tracking

Consider alternatives when:

  • You need to extract text from images initially (use the Visual Analysis Consultant)
  • Your comparison involves non-textual content such as charts, diagrams, or photographs

Quick Start

Configuration

name: text-comparison-assistant type: agent category: ocr-extraction-team

Example Invocation

claude agent:invoke text-comparison-assistant "Compare raw OCR output in ocr-raw.txt against verified reference doc reference.md"

Example Output

Comparison Summary β€” ocr-raw.txt vs reference.md
Overall Accuracy: 96.3%
Total Discrepancies: 27

Critical (4): Missing paragraphs on lines 88-91, 204-210
Major (9): Spelling errors β€” "recieve" β†’ "receive" (line 34), "seperate" β†’ "separate" (line 72) ...
Minor (14): Formatting β€” bullet marker "β€’" vs "-" (lines 15, 23, 41), heading level mismatch (line 1)

Detailed Breakdown:
  [CRITICAL] Lines 88-91: Entire paragraph absent from extracted text
    Reference: "The fiscal quarter ended with a 12% increase..."
    Extracted: (missing)
    Likely cause: Multi-column layout caused paragraph skip

Core Concepts

Comparison Methodology Overview

AspectDetails
Alignment StrategyLongest common subsequence (LCS) based line alignment with fuzzy matching fallback
Error TaxonomyContent (missing/extra/modified), Spelling (substitution/transposition), Formatting (markers/levels), Structural (merge/split)
Severity LevelsCritical (content loss), Major (multiple errors per section), Minor (cosmetic formatting)
Accuracy MetricCharacter-level accuracy weighted by severity, reported as a single percentage

Diff Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Reference Doc   │────▢│  Line Alignment  β”‚
β”‚  (ground truth)  β”‚     β”‚  Engine          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                       β”‚
        β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Extracted Text  │────▢│  Token-Level    β”‚
β”‚  (OCR output)    β”‚     β”‚  Diff Analyzer  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                       β”‚
        β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Error Classifier│────▢│  Severity Report β”‚
β”‚  & Categorizer   β”‚     β”‚  Generator       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

ParameterTypeDefaultDescription
fuzzy_thresholdfloat0.85Similarity ratio above which two lines are considered matching despite minor differences
severity_weightsobject{critical:10, major:3, minor:1}Weights used to compute the weighted accuracy score
ignore_whitespacebooleanfalseWhen true, collapse all whitespace before comparison to focus on content
case_sensitivebooleantrueWhether character comparisons are case-sensitive
context_linesinteger2Number of surrounding lines included in discrepancy reports for context

Best Practices

  1. Normalize Line Endings Before Comparison Mixed line endings (CRLF vs LF) create phantom discrepancies that obscure real errors. Normalize both documents to the same line-ending format before running the comparison to ensure clean, meaningful diffs.

  2. Separate Content Errors from Formatting Errors Mixing content and formatting discrepancies in a single list makes triage difficult. Presenting them in distinct sections allows content reviewers to focus on accuracy while a separate formatting pass handles cosmetic issues, keeping each review focused and efficient.

  3. Quote Both Versions in Every Discrepancy Always include the exact text from both the reference and extracted documents side by side. This eliminates ambiguity about what changed and allows reviewers to make corrections without switching between files, which significantly speeds up the review cycle.

  4. Track Accuracy Over Time Logging the accuracy percentage from each comparison run creates a historical trend that reveals whether your OCR pipeline is improving or degrading. Sudden drops often correlate with source material changes (new scanner, different font, lower DPI).

  5. Handle Structural Differences Explicitly Merged or split paragraphs are among the hardest discrepancies to detect with simple line-based diffs. Identify structural shifts first at the block level before drilling into character-level differences, so you do not misclassify a split paragraph as missing content.

Common Issues

  1. False positives from Unicode normalization differences Characters like curly quotes vs straight quotes, em-dashes vs hyphens, or non-breaking spaces vs regular spaces look identical to humans but differ at the byte level. Apply Unicode NFC normalization to both inputs before comparison, or configure the agent to treat typographic variants as equivalent.

  2. Line alignment failure on reordered sections When OCR engines process multi-column or non-linear layouts, entire sections may appear in a different order than the reference. The alignment engine can misinterpret this as massive insertions and deletions. Use section-header anchoring to realign blocks before performing line-level diffs.

  3. Inflated error counts from repeated boilerplate Headers, footers, and page numbers that repeat across pages can multiply a single OCR error into dozens of reported discrepancies. Deduplicate boilerplate regions before comparison or group identical repeated errors into a single entry with a count annotation.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates