Ultimate PDF Anthropic

An advanced skill for PDF processing with Claude and AI-assisted analysis. Covers PDF text extraction, AI-powered document understanding, structured data extraction from unstructured PDFs, and building document processing pipelines that combine OCR with language model analysis.

When to Use This Skill

Choose this skill when:

Extracting structured data from complex, unstructured PDF documents
Using AI to summarize, analyze, or answer questions about PDF content
Processing invoices, contracts, or reports with varying layouts
Building document understanding pipelines with OCR and LLM analysis
Converting PDF documents to structured formats (JSON, CSV, database records)

Consider alternatives when:

Simple text extraction from well-formatted PDFs → use a basic PDF skill
Creating PDFs from scratch → use a PDF creation/reportlab skill
Merging or splitting PDFs → use a PDF manipulation skill
Working with DOCX files → use a DOCX skill

Quick Start


# Extract and analyze PDF with AI assistance
import pdfplumber
import anthropic

def extract_and_analyze(pdf_path: str, query: str) -> str:
    # Step 1: Extract text from all pages
    with pdfplumber.open(pdf_path) as pdf:
        full_text = '\n\n'.join(
            page.extract_text() or '' for page in pdf.pages
        )

    # Step 2: Analyze with Claude
    client = anthropic.Anthropic()
    response = client.messages.create(
        model='claude-sonnet-4-20250514',
        max_tokens=4096,
        messages=[{
            'role': 'user',
            'content': f'Analyze this PDF content and {query}:\n\n{full_text[:50000]}'
        }],
    )
    return response.content[0].text

Core Concepts

PDF Processing Pipeline

Stage	Tool	Purpose
Extraction	pdfplumber	Text, tables, and metadata
OCR Fallback	Tesseract/PaddleOCR	Scanned or image-based pages
Cleaning	Custom parser	Remove headers, footers, noise
Structuring	AI/LLM analysis	Extract fields, classify sections
Validation	Schema validation	Verify extracted data quality
Output	JSON/CSV/DB	Structured, queryable format

Structured Data Extraction


from pydantic import BaseModel
import json

class InvoiceData(BaseModel):
    vendor_name: str
    invoice_number: str
    date: str
    line_items: list[dict]
    subtotal: float
    tax: float
    total: float

def extract_invoice(pdf_text: str) -> InvoiceData:
    client = anthropic.Anthropic()
    response = client.messages.create(
        model='claude-sonnet-4-20250514',
        max_tokens=2048,
        messages=[{
            'role': 'user',
            'content': f"""Extract invoice data from this text as JSON:

Fields: vendor_name, invoice_number, date (YYYY-MM-DD),
line_items (array of {{description, quantity, unit_price, total}}),
subtotal, tax, total

Text:
{pdf_text}"""
        }],
    )

    data = json.loads(response.content[0].text)
    return InvoiceData(**data)

Table Extraction with Layout Analysis


import pdfplumber

def extract_tables_with_context(pdf_path: str) -> list[dict]:
    results = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            # Extract tables
            tables = page.extract_tables()
            for table in tables:
                if not table or len(table) < 2:
                    continue

                # Get surrounding text for context
                text_above = page.extract_text()
                headers = table[0]
                rows = table[1:]

                results.append({
                    'page': i + 1,
                    'headers': headers,
                    'rows': rows,
                    'context': text_above[:200] if text_above else '',
                })

    return results

Configuration

Parameter	Type	Default	Description
`extractionEngine`	string	`'pdfplumber'`	Engine: pdfplumber, PyPDF2, or pymupdf
`ocrEnabled`	boolean	`true`	Enable OCR for image-based pages
`ocrEngine`	string	`'tesseract'`	OCR: tesseract, paddleocr, or easyocr
`aiModel`	string	`'claude-sonnet-4-20250514'`	AI model for content analysis
`maxPages`	number	`100`	Maximum pages to process
`chunkSize`	number	`50000`	Max characters per AI analysis request

Best Practices

Extract text first, then analyze with AI — Use pdfplumber for text extraction (fast, cheap) and send the extracted text to Claude for analysis (slower, costs tokens). Don't send PDF images to vision models unless text extraction fails.
Chunk large documents for AI analysis — Claude has context limits. Split large PDFs into logical sections (chapters, pages) and analyze each chunk. Merge results with a final synthesis pass.
Use structured output schemas with Pydantic — Define the expected output structure as a Pydantic model. This validates AI-extracted data and catches missing or malformed fields before they reach downstream systems.
Implement OCR fallback for scanned documents — Check if extracted text is empty or very short relative to page count. If text extraction yields nothing, convert pages to images and run OCR.
Cache extracted text for repeated analysis — PDF text extraction is deterministic. Cache the extracted text so multiple analysis queries don't re-process the same PDF. Store extracted text alongside the original file.

Common Issues

AI halluccinates values not present in the PDF — When the PDF is unclear or values are missing, AI may generate plausible but incorrect data. Always validate extracted data against the original text. Add confidence scores and flag low-confidence extractions for human review.

Tables span multiple pages and extraction splits them — pdfplumber extracts tables per-page. Tables that span page breaks produce two incomplete tables. Detect continuation patterns (repeated headers, row numbering) and merge split tables programmatically.

PDF with mixed text and scanned pages — Some PDFs have machine-readable text on some pages and scanned images on others. Check text content per page and apply OCR selectively only to pages where text extraction returns empty results.

⚠️ Loading Issue

Ultimate Pdf Anthropic

Ultimate PDF Anthropic

When to Use This Skill

Quick Start

Core Concepts

PDF Processing Pipeline

Structured Data Extraction

Table Extraction with Layout Analysis

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace