P

Pdf System

Battle-tested skill for comprehensive, manipulation, toolkit, extracting. Includes structured workflows, validation checks, and reusable patterns for document processing.

SkillClipticsdocument processingv1.0.0MIT
0 views0 copies

PDF System

A practical skill for PDF processing including creation, text extraction, merging, splitting, form filling, and conversion. Covers both Python and command-line approaches for common PDF operations.

When to Use This Skill

Choose this skill when:

  • Extracting text and metadata from PDF documents
  • Creating PDF reports with tables, charts, and formatting
  • Merging, splitting, or rotating PDF files
  • Filling PDF forms programmatically
  • Converting between PDF and other formats (images, HTML, DOCX)

Consider alternatives when:

  • Working with DOCX files → use a DOCX skill
  • Creating presentations → use a PPTX skill
  • Building a web-based PDF viewer → use a PDF.js skill
  • Need OCR for scanned documents → use an OCR skill

Quick Start

# Install PDF tools pip install PyPDF2 reportlab pdfplumber # CLI tools brew install poppler # pdftotext, pdfimages brew install qpdf # merge, split, encrypt
# Extract text from PDF import pdfplumber def extract_text(pdf_path: str) -> str: with pdfplumber.open(pdf_path) as pdf: text = '' for page in pdf.pages: text += page.extract_text() or '' text += '\n\n' return text.strip() # Extract tables from PDF def extract_tables(pdf_path: str) -> list: with pdfplumber.open(pdf_path) as pdf: all_tables = [] for page in pdf.pages: tables = page.extract_tables() all_tables.extend(tables) return all_tables

Core Concepts

PDF Operations Matrix

OperationPython LibraryCLI Tool
Read textpdfplumber, PyPDF2pdftotext (poppler)
Read tablespdfplumber, camelottabula-java
Create PDFreportlab, fpdf2wkhtmltopdf
Merge PDFsPyPDF2qpdf, pdfunite
Split PDFPyPDF2qpdf
Form fillingPyPDF2, pdfrwpdftk
EncryptPyPDF2qpdf
Images to PDFPillow + reportlabimg2pdf
PDF to imagespdf2imagepdftoppm

PDF Report Generation

from reportlab.lib import colors from reportlab.lib.pagesizes import A4 from reportlab.lib.styles import getSampleStyleSheet from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer def create_report(output_path: str, title: str, data: list[list]): doc = SimpleDocTemplate(output_path, pagesize=A4) styles = getSampleStyleSheet() elements = [] # Title elements.append(Paragraph(title, styles['Title'])) elements.append(Spacer(1, 20)) # Table table = Table(data) table.setStyle(TableStyle([ ('BACKGROUND', (0, 0), (-1, 0), colors.grey), ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke), ('ALIGN', (0, 0), (-1, -1), 'CENTER'), ('FONTSIZE', (0, 0), (-1, 0), 12), ('BOTTOMPADDING', (0, 0), (-1, 0), 12), ('GRID', (0, 0), (-1, -1), 1, colors.black), ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]), ])) elements.append(table) doc.build(elements) # Usage create_report('report.pdf', 'Sales Report Q1', [ ['Product', 'Units', 'Revenue'], ['Widget A', '1,200', '$36,000'], ['Widget B', '800', '$24,000'], ])

PDF Manipulation

from PyPDF2 import PdfReader, PdfWriter def merge_pdfs(paths: list[str], output: str): writer = PdfWriter() for path in paths: reader = PdfReader(path) for page in reader.pages: writer.add_page(page) with open(output, 'wb') as f: writer.write(f) def split_pdf(path: str, output_dir: str): reader = PdfReader(path) for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f'{output_dir}/page_{i+1}.pdf', 'wb') as f: writer.write(f) def encrypt_pdf(path: str, password: str, output: str): reader = PdfReader(path) writer = PdfWriter() for page in reader.pages: writer.add_page(page) writer.encrypt(password) with open(output, 'wb') as f: writer.write(f)

Configuration

ParameterTypeDefaultDescription
extractionEnginestring'pdfplumber'Text extraction: pdfplumber, PyPDF2, or poppler
creationEnginestring'reportlab'PDF creation: reportlab or fpdf2
pageSizestring'A4'Default page size: A4, Letter, or Legal
ocrFallbackbooleanfalseUse OCR when text extraction fails
imageFormatstring'png'Format for PDF-to-image conversion
imageDPInumber300DPI for image conversion

Best Practices

  1. Use pdfplumber for text extraction, not PyPDF2 — pdfplumber handles complex layouts, multi-column text, and tables significantly better than PyPDF2's basic text extraction. PyPDF2 is better for structural manipulation (merge, split, encrypt).

  2. Handle scanned PDFs with OCR fallback — Many PDFs are scanned images with no extractable text. Check if extracted text is empty and fall back to Tesseract OCR: pytesseract.image_to_string(page_image).

  3. Use reportlab for complex PDF creation, fpdf2 for simple ones — reportlab offers full layout control with tables, charts, and precise positioning. fpdf2 is simpler for text-heavy documents. Choose based on complexity requirements.

  4. Process large PDFs page by page, not all at once — Loading a 1000-page PDF entirely into memory can exhaust RAM. Process pages iteratively using generators and write output incrementally.

  5. Validate PDF output by opening in multiple viewers — PDFs that render correctly in one viewer may break in another. Test generated PDFs in Adobe Reader, Chrome's built-in viewer, and Preview (macOS) to ensure compatibility.

Common Issues

Text extraction returns garbled characters — The PDF may use custom font encoding or non-standard character mappings. Try different extraction libraries (pdfplumber vs poppler's pdftotext). Some PDFs require OCR even though they appear to contain text.

Merged PDF has incorrect page orientation — Source PDFs may have different page sizes or orientations. Check and normalize page dimensions before merging, or rotate pages to match a target orientation.

Table extraction misaligns columns — PDF tables don't have true table structure — they're just positioned text. pdfplumber's extract_tables() uses heuristics that can fail on complex layouts. Specify explicit table boundaries with table_settings for better accuracy.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates