D

Dynamic PDF Toolkit Toolkit

A comprehensive skill that enables extract text, tables, merge and annotate PDF files. Built for Claude Code with best practices and real-world patterns.

SkillCommunitydevelopmentv1.0.0MIT
0 views0 copies

PDF Toolkit

A document processing skill for generating, manipulating, merging, and extracting data from PDF files using libraries like pdf-lib, pdfkit, PyPDF, and reportlab.

When to Use

Choose PDF Toolkit when:

  • Generating PDF reports, invoices, and certificates programmatically
  • Merging, splitting, and manipulating existing PDF documents
  • Extracting text, tables, and metadata from PDF files
  • Adding watermarks, headers, footers, and page numbers to PDFs

Consider alternatives when:

  • Creating editable documents — use Word or Google Docs format
  • Simple text output — use HTML or Markdown
  • Interactive forms — use web forms instead of PDF forms

Quick Start

# Python pip install reportlab PyPDF2 pdfplumber # Node.js npm install pdf-lib pdfkit
import { PDFDocument, rgb, StandardFonts } from 'pdf-lib'; async function createInvoice(data: InvoiceData) { const pdfDoc = await PDFDocument.create(); const page = pdfDoc.addPage([595, 842]); // A4 size const font = await pdfDoc.embedFont(StandardFonts.Helvetica); const boldFont = await pdfDoc.embedFont(StandardFonts.HelveticaBold); const { width, height } = page.getSize(); // Header page.drawText('INVOICE', { x: 50, y: height - 50, size: 28, font: boldFont, color: rgb(0.2, 0.2, 0.8) }); page.drawText(`Invoice #: ${data.invoiceNumber}`, { x: 50, y: height - 90, size: 12, font }); page.drawText(`Date: ${data.date}`, { x: 50, y: height - 110, size: 12, font }); // Line items table let yPos = height - 180; const headers = ['Item', 'Qty', 'Price', 'Total']; const colX = [50, 300, 400, 480]; headers.forEach((header, i) => { page.drawText(header, { x: colX[i], y: yPos, size: 11, font: boldFont }); }); yPos -= 5; page.drawLine({ start: { x: 50, y: yPos }, end: { x: 545, y: yPos }, thickness: 1 }); yPos -= 20; for (const item of data.items) { page.drawText(item.name, { x: colX[0], y: yPos, size: 10, font }); page.drawText(String(item.quantity), { x: colX[1], y: yPos, size: 10, font }); page.drawText(`$${item.price.toFixed(2)}`, { x: colX[2], y: yPos, size: 10, font }); page.drawText(`$${(item.quantity * item.price).toFixed(2)}`, { x: colX[3], y: yPos, size: 10, font }); yPos -= 20; } // Total yPos -= 10; page.drawLine({ start: { x: 400, y: yPos }, end: { x: 545, y: yPos }, thickness: 1 }); yPos -= 20; page.drawText(`Total: $${data.total.toFixed(2)}`, { x: 400, y: yPos, size: 14, font: boldFont }); return await pdfDoc.save(); }

Core Concepts

PDF Library Comparison

LibraryLanguageBest ForFeatures
pdf-libJS/TSCreating and modifying PDFsCreate, modify, merge, form fill
PDFKitNode.jsGenerating complex PDFsDrawing API, vector graphics
ReportLabPythonEnterprise PDF generationComplex layouts, charts
PyPDF2PythonManipulating existing PDFsMerge, split, rotate, encrypt
pdfplumberPythonText and table extractionTable detection, text position
PuppeteerNode.jsHTML to PDF conversionBrowser rendering

PDF Manipulation Operations

from PyPDF2 import PdfReader, PdfWriter, PdfMerger class PDFProcessor: def merge_pdfs(self, input_paths, output_path): merger = PdfMerger() for path in input_paths: merger.append(path) merger.write(output_path) merger.close() def split_pdf(self, input_path, output_dir): reader = PdfReader(input_path) for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"{output_dir}/page_{i+1}.pdf", 'wb') as f: writer.write(f) def add_watermark(self, input_path, watermark_path, output_path): reader = PdfReader(input_path) watermark = PdfReader(watermark_path).pages[0] writer = PdfWriter() for page in reader.pages: page.merge_page(watermark) writer.add_page(page) with open(output_path, 'wb') as f: writer.write(f) def extract_text(self, input_path): import pdfplumber with pdfplumber.open(input_path) as pdf: text = '' for page in pdf.pages: text += page.extract_text() + '\n' tables = page.extract_tables() for table in tables: for row in table: text += ' | '.join(str(cell) for cell in row) + '\n' return text

Configuration

OptionDescriptionDefault
page_sizePage dimensions: A4, letter, legal"A4"
marginsPage margins in points{ top: 72, right: 72, bottom: 72, left: 72 }
default_fontDefault font family"Helvetica"
font_sizeDefault font size in points12
compressionEnable PDF compressiontrue
metadataPDF metadata (author, title, subject){}
encryptionPDF password protectionnull
dpiImage resolution for embedded images150

Best Practices

  1. Use HTML-to-PDF conversion for complex layouts with CSS styling rather than positioning elements manually with coordinates — Puppeteer or wkhtmltopdf render HTML/CSS faithfully and are much faster to develop with
  2. Embed fonts if using non-standard fonts to ensure the PDF renders correctly on any system; relying on system fonts causes display issues when recipients do not have the same fonts installed
  3. Set proper metadata (title, author, subject, keywords) in every generated PDF to improve accessibility, searchability, and organization
  4. Optimize file size by compressing images before embedding, using vector graphics where possible, and removing unused embedded resources from manipulated PDFs
  5. Test PDF output with multiple viewers (Adobe Reader, Chrome, Preview) because rendering differences between viewers can cause layout issues that only appear on specific platforms

Common Issues

Text extraction returning garbled output: Some PDFs use non-standard font encodings or are scanned images without OCR text layers. Check if the PDF has a text layer with pdfplumber, and if not, run OCR with pytesseract on page images extracted from the PDF.

Coordinate system confusion: PDF coordinates start from the bottom-left corner with Y increasing upward, which is opposite to most screen coordinate systems. Use helper functions that convert from top-left coordinates to PDF coordinates: pdfY = pageHeight - screenY.

Large PDFs causing memory issues: Processing PDFs with hundreds of pages or high-resolution images consumes significant memory. Process pages one at a time instead of loading the entire document, stream output to disk rather than holding it in memory, and reduce image DPI for embedded images.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates