Text Comparison Assistant
All-in-one agent covering text, comparison, validation, specialist. Includes structured workflows, validation checks, and reusable patterns for ocr extraction team.
Text Comparison Assistant
Line-by-line comparison specialist that identifies discrepancies between extracted OCR text and reference documents with categorized severity reporting.
When to Use This Agent
Choose this agent when you need to:
- Perform systematic diff analysis between raw OCR output and a corrected or reference markdown file
- Categorize errors by type (spelling, missing content, formatting) and severity (critical, major, minor)
- Generate an actionable comparison report with line-number references and suggested fixes
- Establish an accuracy percentage baseline for OCR pipeline quality tracking
Consider alternatives when:
- You need to extract text from images initially (use the Visual Analysis Consultant)
- Your comparison involves non-textual content such as charts, diagrams, or photographs
Quick Start
Configuration
name: text-comparison-assistant type: agent category: ocr-extraction-team
Example Invocation
claude agent:invoke text-comparison-assistant "Compare raw OCR output in ocr-raw.txt against verified reference doc reference.md"
Example Output
Comparison Summary β ocr-raw.txt vs reference.md
Overall Accuracy: 96.3%
Total Discrepancies: 27
Critical (4): Missing paragraphs on lines 88-91, 204-210
Major (9): Spelling errors β "recieve" β "receive" (line 34), "seperate" β "separate" (line 72) ...
Minor (14): Formatting β bullet marker "β’" vs "-" (lines 15, 23, 41), heading level mismatch (line 1)
Detailed Breakdown:
[CRITICAL] Lines 88-91: Entire paragraph absent from extracted text
Reference: "The fiscal quarter ended with a 12% increase..."
Extracted: (missing)
Likely cause: Multi-column layout caused paragraph skip
Core Concepts
Comparison Methodology Overview
| Aspect | Details |
|---|---|
| Alignment Strategy | Longest common subsequence (LCS) based line alignment with fuzzy matching fallback |
| Error Taxonomy | Content (missing/extra/modified), Spelling (substitution/transposition), Formatting (markers/levels), Structural (merge/split) |
| Severity Levels | Critical (content loss), Major (multiple errors per section), Minor (cosmetic formatting) |
| Accuracy Metric | Character-level accuracy weighted by severity, reported as a single percentage |
Diff Pipeline Architecture
βββββββββββββββββββ βββββββββββββββββββ
β Reference Doc ββββββΆβ Line Alignment β
β (ground truth) β β Engine β
βββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Extracted Text ββββββΆβ Token-Level β
β (OCR output) β β Diff Analyzer β
βββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Error ClassifierββββββΆβ Severity Report β
β & Categorizer β β Generator β
βββββββββββββββββββ βββββββββββββββββββ
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
| fuzzy_threshold | float | 0.85 | Similarity ratio above which two lines are considered matching despite minor differences |
| severity_weights | object | {critical:10, major:3, minor:1} | Weights used to compute the weighted accuracy score |
| ignore_whitespace | boolean | false | When true, collapse all whitespace before comparison to focus on content |
| case_sensitive | boolean | true | Whether character comparisons are case-sensitive |
| context_lines | integer | 2 | Number of surrounding lines included in discrepancy reports for context |
Best Practices
-
Normalize Line Endings Before Comparison Mixed line endings (CRLF vs LF) create phantom discrepancies that obscure real errors. Normalize both documents to the same line-ending format before running the comparison to ensure clean, meaningful diffs.
-
Separate Content Errors from Formatting Errors Mixing content and formatting discrepancies in a single list makes triage difficult. Presenting them in distinct sections allows content reviewers to focus on accuracy while a separate formatting pass handles cosmetic issues, keeping each review focused and efficient.
-
Quote Both Versions in Every Discrepancy Always include the exact text from both the reference and extracted documents side by side. This eliminates ambiguity about what changed and allows reviewers to make corrections without switching between files, which significantly speeds up the review cycle.
-
Track Accuracy Over Time Logging the accuracy percentage from each comparison run creates a historical trend that reveals whether your OCR pipeline is improving or degrading. Sudden drops often correlate with source material changes (new scanner, different font, lower DPI).
-
Handle Structural Differences Explicitly Merged or split paragraphs are among the hardest discrepancies to detect with simple line-based diffs. Identify structural shifts first at the block level before drilling into character-level differences, so you do not misclassify a split paragraph as missing content.
Common Issues
-
False positives from Unicode normalization differences Characters like curly quotes vs straight quotes, em-dashes vs hyphens, or non-breaking spaces vs regular spaces look identical to humans but differ at the byte level. Apply Unicode NFC normalization to both inputs before comparison, or configure the agent to treat typographic variants as equivalent.
-
Line alignment failure on reordered sections When OCR engines process multi-column or non-linear layouts, entire sections may appear in a different order than the reference. The alignment engine can misinterpret this as massive insertions and deletions. Use section-header anchoring to realign blocks before performing line-level diffs.
-
Inflated error counts from repeated boilerplate Headers, footers, and page numbers that repeat across pages can multiply a single OCR error into dozens of reported discrepancies. Deduplicate boilerplate regions before comparison or group identical repeated errors into a single entry with a count annotation.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.