P

Pdf Official Toolkit

Battle-tested skill for comprehensive, manipulation, toolkit, extracting. Includes structured workflows, validation checks, and reusable patterns for document processing.

SkillClipticsdocument processingv1.0.0MIT
0 views0 copies

PDF Official Toolkit

A standardized skill for PDF processing following best practices. Covers the complete PDF lifecycle from reading and extraction through creation, manipulation, and format conversion with production-ready patterns.

When to Use This Skill

Choose this skill when:

  • Building production PDF processing pipelines
  • Extracting content with proper error handling and validation
  • Creating professional PDF reports and documents
  • Implementing PDF manipulation workflows (merge, split, watermark)
  • Converting between PDF and other document formats

Consider alternatives when:

  • Need AI-assisted PDF analysis → use a PDF Anthropic skill
  • Working with DOCX files → use a DOCX skill
  • Need web-based PDF rendering → use PDF.js
  • Creating spreadsheets → use an XLSX skill

Quick Start

# Production PDF extraction with fallback chain import pdfplumber from pathlib import Path def robust_extract(pdf_path: str) -> dict: """Extract PDF content with multiple strategy fallbacks.""" path = Path(pdf_path) if not path.exists(): raise FileNotFoundError(f"PDF not found: {pdf_path}") result = {'pages': [], 'metadata': {}, 'tables': []} with pdfplumber.open(pdf_path) as pdf: result['metadata'] = pdf.metadata or {} for i, page in enumerate(pdf.pages): page_data = { 'number': i + 1, 'text': page.extract_text() or '', 'tables': page.extract_tables() or [], 'dimensions': { 'width': float(page.width), 'height': float(page.height), }, } if not page_data['text'] and not page_data['tables']: page_data['needs_ocr'] = True result['pages'].append(page_data) result['tables'].extend(page_data['tables']) return result

Core Concepts

PDF Processing Capabilities

CapabilityLibraryMethod
Text extractionpdfplumberpage.extract_text()
Table extractionpdfplumberpage.extract_tables()
Image extractionpdfplumber/pymupdfpage.images / get_pixmap()
Metadata readingPyPDF2reader.metadata
Page mergingPyPDF2writer.add_page()
Page splittingPyPDF2Iterate reader.pages
Watermarkingreportlab + PyPDF2Overlay watermark page
Form fillingPyPDF2/pdfrwUpdate form field values
EncryptionPyPDF2writer.encrypt()
PDF creationreportlabSimpleDocTemplate

Watermarking Pattern

from PyPDF2 import PdfReader, PdfWriter from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import A4 from io import BytesIO def add_watermark(input_path: str, output_path: str, text: str = 'CONFIDENTIAL'): # Create watermark buffer = BytesIO() c = canvas.Canvas(buffer, pagesize=A4) c.setFont('Helvetica', 50) c.setFillAlpha(0.15) c.setFillColorRGB(0.5, 0.5, 0.5) c.translate(A4[0]/2, A4[1]/2) c.rotate(45) c.drawCentredString(0, 0, text) c.save() buffer.seek(0) watermark = PdfReader(buffer) reader = PdfReader(input_path) writer = PdfWriter() for page in reader.pages: page.merge_page(watermark.pages[0]) writer.add_page(page) with open(output_path, 'wb') as f: writer.write(f)

Batch Processing Pipeline

import os from concurrent.futures import ProcessPoolExecutor from pathlib import Path def batch_process_pdfs(input_dir: str, output_dir: str, operations: list[str]): """Process multiple PDFs with specified operations.""" Path(output_dir).mkdir(parents=True, exist_ok=True) pdf_files = list(Path(input_dir).glob('*.pdf')) def process_one(pdf_path: Path) -> dict: try: result = robust_extract(str(pdf_path)) output_path = Path(output_dir) / f'{pdf_path.stem}_processed.json' if 'extract_text' in operations: text = '\n'.join(p['text'] for p in result['pages']) (Path(output_dir) / f'{pdf_path.stem}.txt').write_text(text) if 'extract_tables' in operations: import json tables_path = Path(output_dir) / f'{pdf_path.stem}_tables.json' tables_path.write_text(json.dumps(result['tables'], indent=2)) return {'file': pdf_path.name, 'status': 'success', 'pages': len(result['pages'])} except Exception as e: return {'file': pdf_path.name, 'status': 'error', 'error': str(e)} with ProcessPoolExecutor(max_workers=4) as executor: results = list(executor.map(process_one, pdf_files)) return results

Configuration

ParameterTypeDefaultDescription
extractionLibstring'pdfplumber'Text extraction: pdfplumber or pymupdf
manipulationLibstring'PyPDF2'PDF manipulation: PyPDF2 or pypdf
creationLibstring'reportlab'PDF creation: reportlab or fpdf2
ocrFallbackbooleantrueEnable OCR for image-only pages
parallelWorkersnumber4Concurrent workers for batch processing
maxFileSizenumber100Max PDF size in MB

Best Practices

  1. Use pdfplumber for extraction, PyPDF2 for manipulation, reportlab for creation — Each library excels at its specialty. Using the right tool for each operation produces the best results and simplest code.

  2. Implement a processing pipeline with error recovery — Production PDF processing encounters corrupt files, password-protected PDFs, and encoding issues. Wrap each step in try/except, log failures, and continue processing remaining files.

  3. Validate extracted data before downstream use — PDF text extraction is lossy. Verify that extracted amounts, dates, and identifiers match expected patterns. Flag anomalies for manual review rather than processing incorrect data silently.

  4. Process pages as streams for large PDFs — Loading all pages of a 500-page PDF into memory simultaneously exhausts RAM. Process pages one at a time and write output incrementally.

  5. Test with diverse PDF sources — PDFs created by different applications (Word, LaTeX, web browsers, scanners) have different internal structures. Test your extraction code with PDFs from all expected sources.

Common Issues

Password-protected PDFs fail silently — PyPDF2 raises PdfReadError for encrypted PDFs. Check for encryption first with reader.is_encrypted and attempt decryption with a provided password before processing.

Text extraction returns text in wrong order — PDFs don't store text in reading order. pdfplumber uses spatial analysis but can fail with complex multi-column layouts. Sort extracted text blocks by position (top-to-bottom, left-to-right) for more natural ordering.

Generated PDFs are very large — Embedded images without compression inflate file size. Compress images before embedding, use JPEG for photos and PNG for diagrams. Set image DPI to 150 for screen viewing, 300 for print.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates