D

Document Structure Analyzer Companion

Boost productivity using this document, structure, analysis, specialist. Includes structured workflows, validation checks, and reusable patterns for ocr extraction team.

AgentClipticsocr extraction teamv1.0.0MIT
0 views0 copies

Document Structure Analyzer Companion

Layout analysis and semantic mapping agent that deconstructs document structures into labeled regions, reading orders, and hierarchical content schemas for OCR preprocessing.

When to Use This Agent

Choose this agent when you need to:

  • Analyze complex multi-column layouts before running OCR to improve extraction accuracy
  • Map document hierarchies (headers, subheaders, body text, captions) for structured output
  • Identify and classify visual elements like tables, forms, figures, and sidebars
  • Determine correct reading order for documents with non-linear content flow

Consider alternatives when:

  • Your documents are simple single-column text with no complex layout elements
  • You need post-OCR grammar correction rather than pre-OCR structural analysis (use Specialist OCR Grammar Fixer)

Quick Start

Configuration

name: document-structure-analyzer-companion type: agent category: ocr-extraction-team

Example Invocation

claude agent:invoke document-structure-analyzer-companion "Analyze the structure of invoice-batch-042.pdf"

Example Output

=== Document Structure Analysis ===
File: invoice-batch-042.pdf (3 pages)

PAGE 1 REGIONS:
  [HEADER]     Logo + Company Name (confidence: 0.97)
  [TABLE]      Line items table, 5 columns x 12 rows (confidence: 0.94)
  [SIDEBAR]    Payment terms block, right margin (confidence: 0.89)
  [FOOTER]     Page number + legal disclaimer (confidence: 0.96)

READING ORDER: Header β†’ Table β†’ Sidebar β†’ Footer
HIERARCHY: H1(Invoice #) β†’ H2(Bill To, Ship To) β†’ Body(line items)
TEMPLATE MATCH: Standard commercial invoice (92% match)

OCR RECOMMENDATIONS:
  - Process table region with grid-aware extraction
  - Treat sidebar as independent text block
  - Apply deskew correction (2.1Β° detected)

Core Concepts

Document Region Types Overview

AspectDetails
Content BlocksContinuous text regions like paragraphs, headings, and captions
Tabular RegionsStructured grid areas including tables, forms, and ledgers
Visual ElementsNon-text regions such as images, charts, logos, and diagrams
Navigation MarkersPage numbers, headers, footers, and section dividers

Structure Analysis Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Page Image  │────▢│  Region     β”‚
β”‚  Ingestion   β”‚     β”‚  Segmenter  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                   β”‚
        β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Reading     │────▢│  Hierarchy  β”‚
β”‚  Order Engineβ”‚     β”‚  Mapper     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

ParameterTypeDefaultDescription
min_confidencefloat0.80Minimum confidence score for a region classification to be included
deskew_correctionbooleantrueAutomatically detect and correct page rotation before analysis
table_detection_modestring"grid-aware"Table detection strategy: grid-aware, line-based, or whitespace
max_pagesinteger50Maximum number of pages to analyze in a single invocation
output_formatstring"json"Output format for structure maps: json, yaml, or markdown

Best Practices

  1. Run Structure Analysis Before OCR Extraction Feeding region boundaries and reading order to the OCR engine dramatically improves extraction accuracy. Without structure analysis, OCR processes text linearly and mangles multi-column layouts, tables, and sidebars.

  2. Calibrate Confidence Thresholds Per Document Type Scanned handwritten documents produce lower confidence scores than clean digital PDFs. Lower min_confidence to 0.65 for handwritten or degraded inputs to avoid discarding valid but uncertain region detections.

  3. Use Template Matching for Recurring Document Types If you process the same form or invoice layout repeatedly, save the structure analysis as a template. Future documents matching that template skip the full analysis pipeline and process significantly faster.

  4. Verify Reading Order on Complex Layouts Multi-column academic papers, magazine spreads, and brochures have non-obvious reading orders. Always review the suggested reading order for complex layouts, as incorrect ordering produces incoherent OCR output.

  5. Separate Table Regions for Dedicated Processing Tables require grid-aware extraction that differs fundamentally from paragraph text OCR. The structure analyzer marks table boundaries so downstream processors can apply specialized table extraction algorithms.

Common Issues

  1. Sidebar text merged with main body content in reading order Sidebars positioned close to the main text column may be incorrectly merged. Increase the column_gap_threshold parameter to require a wider whitespace gap before treating adjacent regions as separate columns.

  2. Table detection fails on borderless tables Tables without visible gridlines require whitespace-based detection. Switch table_detection_mode from "grid-aware" to "whitespace" for documents that use spacing rather than borders to delineate table cells.

  3. Rotated or skewed pages produce misaligned region boundaries Even with deskew_correction enabled, pages rotated more than 5 degrees may not correct fully. Pre-process heavily skewed documents with a dedicated image rotation tool before submitting them for structure analysis.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates