V

Visual Analysis Consultant

Boost productivity using this visual, analysis, specialist, proactively. Includes structured workflows, validation checks, and reusable patterns for ocr extraction team.

AgentClipticsocr extraction teamv1.0.0MIT
0 views0 copies

Visual Analysis Consultant

Expert image analysis and OCR extraction agent that converts scanned documents into structured markdown while preserving visual hierarchy, formatting, and semantic meaning.

When to Use This Agent

Choose this agent when you need to:

  • Extract text from scanned documents, photographs of text, or screenshots with high accuracy
  • Convert visually structured content (headings, lists, tables, callouts) into clean markdown
  • Analyze complex document layouts including multi-column pages, nested lists, and embedded code blocks
  • Handle edge cases like rotated text, watermarks, faded ink, or low-resolution scans

Consider alternatives when:

  • You already have digital text and need to compare two versions (use the Text Comparison Assistant)
  • Your primary goal is final quality validation rather than initial extraction (use the OCR Quality Assurance Agent)

Quick Start

Configuration

name: visual-analysis-consultant type: agent category: ocr-extraction-team

Example Invocation

claude agent:invoke visual-analysis-consultant "Extract and structure text from scanned-contract-page4.png preserving all headings and table formatting"

Example Output

## Section 4: Payment Terms

| Term | Duration | Rate |
|------|----------|------|
| Initial | 12 months | $2,400/mo |
| Renewal | 6 months | $2,160/mo |

Payment is due on the **first business day** of each calendar month.
Late payments incur a *1.5% monthly* penalty fee.

[LOW CONFIDENCE] Footnote text partially obscured: "Subject to Β§12.3..."

Core Concepts

Document Analysis Pipeline Overview

AspectDetails
Input FormatsPNG, JPEG, TIFF, PDF page rasters, screenshots
Layout DetectionMulti-column recognition, reading-order inference, region segmentation
Semantic MappingFont size to heading level, weight to emphasis, indentation to list nesting
Confidence ReportingPer-region confidence scores with LOW/MEDIUM/HIGH annotations on uncertain segments

Extraction Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Source Image    │────▢│  Layout Region  β”‚
β”‚  Input          β”‚     β”‚  Segmentation   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                       β”‚
        β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Reading Order   │────▢│  OCR Character  β”‚
β”‚  Inference       β”‚     β”‚  Recognition    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                       β”‚
        β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Semantic Style  │────▢│  Markdown       β”‚
β”‚  Classification  β”‚     β”‚  Renderer       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

ParameterTypeDefaultDescription
heading_detectionstringfont-sizeStrategy for heading inference: font-size, weight, or combined
min_confidencefloat0.80Characters below this threshold are wrapped in uncertainty markers
table_detectionbooleantrueEnable automatic table structure recognition from grid lines or alignment
preserve_emphasisbooleantrueMap bold/italic visual styles to markdown emphasis markers
multi_column_modestringautoColumn handling: auto (detect), single, dual, or triple

Best Practices

  1. Perform a Full-Page Scan Before Extracting Details Begin with a high-level structural pass to identify all regions (headers, body, sidebars, footnotes) before diving into character-level extraction. This prevents missing entire sections that fall outside the primary reading flow, such as margin notes or pull quotes.

  2. Infer Reading Order from Visual Layout, Not File Order Rasterized images have no inherent text order. Use column boundaries, vertical positions, and indentation cues to reconstruct the intended reading sequence. Processing left column fully before right column is essential for two-column academic papers and newsletters.

  3. Map Visual Hierarchy to Markdown Semantics Consistently Establish a deterministic mapping between font sizes and heading levels (e.g., 24pt maps to H1, 18pt to H2, 14pt to H3) and apply it uniformly across the document. Inconsistent mapping produces confusing output that undermines downstream processing.

  4. Flag Uncertain Regions Instead of Guessing When a character or word falls below the confidence threshold, annotate it with a marker rather than silently inserting your best guess. Downstream agents and human reviewers depend on these annotations to prioritize their review effort effectively.

  5. Handle Non-Text Elements Descriptively Images, diagrams, and decorative elements cannot be extracted as text, but their presence and relationship to surrounding text should be documented. Use descriptive placeholders like [Figure: Bar chart showing Q3 revenue by region] to maintain document completeness.

Common Issues

  1. Multi-column text merges into a single stream Without explicit column detection, adjacent columns can interleave line by line, producing garbled output. Enable multi_column_mode and verify that the agent correctly identifies column gutters. For documents with irregular column widths, manual region hints may be needed.

  2. Table structure lost when grid lines are faint or absent Many scanned documents use alignment rather than visible borders to define tables. When table_detection fails on borderless tables, provide a hint about expected column count or switch to whitespace-based column detection, which uses character spacing patterns to infer boundaries.

  3. Heading levels inconsistent across pages If the source document uses slightly different font sizes on different pages (common with photocopied or re-scanned materials), the heading level mapping can drift. Calibrate heading detection per page or define explicit pixel-range thresholds to maintain consistency throughout the extraction.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates