Text Comparison Assistant

Line-by-line comparison specialist that identifies discrepancies between extracted OCR text and reference documents with categorized severity reporting.

When to Use This Agent

Choose this agent when you need to:

Perform systematic diff analysis between raw OCR output and a corrected or reference markdown file
Categorize errors by type (spelling, missing content, formatting) and severity (critical, major, minor)
Generate an actionable comparison report with line-number references and suggested fixes
Establish an accuracy percentage baseline for OCR pipeline quality tracking

Consider alternatives when:

You need to extract text from images initially (use the Visual Analysis Consultant)
Your comparison involves non-textual content such as charts, diagrams, or photographs

Quick Start

Configuration


name: text-comparison-assistant
type: agent
category: ocr-extraction-team

Example Invocation


claude agent:invoke text-comparison-assistant "Compare raw OCR output in ocr-raw.txt against verified reference doc reference.md"

Example Output

Comparison Summary — ocr-raw.txt vs reference.md
Overall Accuracy: 96.3%
Total Discrepancies: 27

Critical (4): Missing paragraphs on lines 88-91, 204-210
Major (9): Spelling errors — "recieve" → "receive" (line 34), "seperate" → "separate" (line 72) ...
Minor (14): Formatting — bullet marker "•" vs "-" (lines 15, 23, 41), heading level mismatch (line 1)

Detailed Breakdown:
  [CRITICAL] Lines 88-91: Entire paragraph absent from extracted text
    Reference: "The fiscal quarter ended with a 12% increase..."
    Extracted: (missing)
    Likely cause: Multi-column layout caused paragraph skip

Core Concepts

Comparison Methodology Overview

Aspect	Details
Alignment Strategy	Longest common subsequence (LCS) based line alignment with fuzzy matching fallback
Error Taxonomy	Content (missing/extra/modified), Spelling (substitution/transposition), Formatting (markers/levels), Structural (merge/split)
Severity Levels	Critical (content loss), Major (multiple errors per section), Minor (cosmetic formatting)
Accuracy Metric	Character-level accuracy weighted by severity, reported as a single percentage

Diff Pipeline Architecture

┌─────────────────┐     ┌─────────────────┐
│  Reference Doc   │────▶│  Line Alignment  │
│  (ground truth)  │     │  Engine          │
└─────────────────┘     └─────────────────┘
        │                       │
        ▼                       ▼
┌─────────────────┐     ┌─────────────────┐
│  Extracted Text  │────▶│  Token-Level    │
│  (OCR output)    │     │  Diff Analyzer  │
└─────────────────┘     └─────────────────┘
        │                       │
        ▼                       ▼
┌─────────────────┐     ┌─────────────────┐
│  Error Classifier│────▶│  Severity Report │
│  & Categorizer   │     │  Generator       │
└─────────────────┘     └─────────────────┘

Configuration

Parameter	Type	Default	Description
fuzzy_threshold	float	0.85	Similarity ratio above which two lines are considered matching despite minor differences
severity_weights	object	`{critical:10, major:3, minor:1}`	Weights used to compute the weighted accuracy score
ignore_whitespace	boolean	false	When true, collapse all whitespace before comparison to focus on content
case_sensitive	boolean	true	Whether character comparisons are case-sensitive
context_lines	integer	2	Number of surrounding lines included in discrepancy reports for context

Best Practices

Normalize Line Endings Before Comparison Mixed line endings (CRLF vs LF) create phantom discrepancies that obscure real errors. Normalize both documents to the same line-ending format before running the comparison to ensure clean, meaningful diffs.
Separate Content Errors from Formatting Errors Mixing content and formatting discrepancies in a single list makes triage difficult. Presenting them in distinct sections allows content reviewers to focus on accuracy while a separate formatting pass handles cosmetic issues, keeping each review focused and efficient.
Quote Both Versions in Every Discrepancy Always include the exact text from both the reference and extracted documents side by side. This eliminates ambiguity about what changed and allows reviewers to make corrections without switching between files, which significantly speeds up the review cycle.
Track Accuracy Over Time Logging the accuracy percentage from each comparison run creates a historical trend that reveals whether your OCR pipeline is improving or degrading. Sudden drops often correlate with source material changes (new scanner, different font, lower DPI).
Handle Structural Differences Explicitly Merged or split paragraphs are among the hardest discrepancies to detect with simple line-based diffs. Identify structural shifts first at the block level before drilling into character-level differences, so you do not misclassify a split paragraph as missing content.

Common Issues

False positives from Unicode normalization differences Characters like curly quotes vs straight quotes, em-dashes vs hyphens, or non-breaking spaces vs regular spaces look identical to humans but differ at the byte level. Apply Unicode NFC normalization to both inputs before comparison, or configure the agent to treat typographic variants as equivalent.
Line alignment failure on reordered sections When OCR engines process multi-column or non-linear layouts, entire sections may appear in a different order than the reference. The alignment engine can misinterpret this as massive insertions and deletions. Use section-header anchoring to realign blocks before performing line-level diffs.
Inflated error counts from repeated boilerplate Headers, footers, and page numbers that repeat across pages can multiply a single OCR error into dozens of reported discrepancies. Deduplicate boilerplate regions before comparison or group identical repeated errors into a single entry with a count annotation.

⚠️ Loading Issue

Text Comparison Assistant

Text Comparison Assistant

When to Use This Agent

Quick Start

Configuration

Example Invocation

Example Output

Core Concepts

Comparison Methodology Overview

Diff Pipeline Architecture

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner