PDF Toolkit

A document processing skill for generating, manipulating, merging, and extracting data from PDF files using libraries like pdf-lib, pdfkit, PyPDF, and reportlab.

When to Use

Choose PDF Toolkit when:

Generating PDF reports, invoices, and certificates programmatically
Merging, splitting, and manipulating existing PDF documents
Extracting text, tables, and metadata from PDF files
Adding watermarks, headers, footers, and page numbers to PDFs

Consider alternatives when:

Creating editable documents — use Word or Google Docs format
Simple text output — use HTML or Markdown
Interactive forms — use web forms instead of PDF forms

Quick Start


# Python
pip install reportlab PyPDF2 pdfplumber

# Node.js
npm install pdf-lib pdfkit


import { PDFDocument, rgb, StandardFonts } from 'pdf-lib';

async function createInvoice(data: InvoiceData) {
  const pdfDoc = await PDFDocument.create();
  const page = pdfDoc.addPage([595, 842]); // A4 size
  const font = await pdfDoc.embedFont(StandardFonts.Helvetica);
  const boldFont = await pdfDoc.embedFont(StandardFonts.HelveticaBold);

  const { width, height } = page.getSize();

  // Header
  page.drawText('INVOICE', {
    x: 50, y: height - 50,
    size: 28, font: boldFont, color: rgb(0.2, 0.2, 0.8)
  });

  page.drawText(`Invoice #: ${data.invoiceNumber}`, {
    x: 50, y: height - 90, size: 12, font
  });
  page.drawText(`Date: ${data.date}`, {
    x: 50, y: height - 110, size: 12, font
  });

  // Line items table
  let yPos = height - 180;
  const headers = ['Item', 'Qty', 'Price', 'Total'];
  const colX = [50, 300, 400, 480];

  headers.forEach((header, i) => {
    page.drawText(header, {
      x: colX[i], y: yPos, size: 11, font: boldFont
    });
  });

  yPos -= 5;
  page.drawLine({
    start: { x: 50, y: yPos },
    end: { x: 545, y: yPos },
    thickness: 1
  });

  yPos -= 20;
  for (const item of data.items) {
    page.drawText(item.name, { x: colX[0], y: yPos, size: 10, font });
    page.drawText(String(item.quantity), { x: colX[1], y: yPos, size: 10, font });
    page.drawText(`$${item.price.toFixed(2)}`, { x: colX[2], y: yPos, size: 10, font });
    page.drawText(`$${(item.quantity * item.price).toFixed(2)}`, {
      x: colX[3], y: yPos, size: 10, font
    });
    yPos -= 20;
  }

  // Total
  yPos -= 10;
  page.drawLine({ start: { x: 400, y: yPos }, end: { x: 545, y: yPos }, thickness: 1 });
  yPos -= 20;
  page.drawText(`Total: $${data.total.toFixed(2)}`, {
    x: 400, y: yPos, size: 14, font: boldFont
  });

  return await pdfDoc.save();
}

Core Concepts

PDF Library Comparison

Library	Language	Best For	Features
pdf-lib	JS/TS	Creating and modifying PDFs	Create, modify, merge, form fill
PDFKit	Node.js	Generating complex PDFs	Drawing API, vector graphics
ReportLab	Python	Enterprise PDF generation	Complex layouts, charts
PyPDF2	Python	Manipulating existing PDFs	Merge, split, rotate, encrypt
pdfplumber	Python	Text and table extraction	Table detection, text position
Puppeteer	Node.js	HTML to PDF conversion	Browser rendering

PDF Manipulation Operations


from PyPDF2 import PdfReader, PdfWriter, PdfMerger

class PDFProcessor:
    def merge_pdfs(self, input_paths, output_path):
        merger = PdfMerger()
        for path in input_paths:
            merger.append(path)
        merger.write(output_path)
        merger.close()

    def split_pdf(self, input_path, output_dir):
        reader = PdfReader(input_path)
        for i, page in enumerate(reader.pages):
            writer = PdfWriter()
            writer.add_page(page)
            with open(f"{output_dir}/page_{i+1}.pdf", 'wb') as f:
                writer.write(f)

    def add_watermark(self, input_path, watermark_path, output_path):
        reader = PdfReader(input_path)
        watermark = PdfReader(watermark_path).pages[0]
        writer = PdfWriter()
        for page in reader.pages:
            page.merge_page(watermark)
            writer.add_page(page)
        with open(output_path, 'wb') as f:
            writer.write(f)

    def extract_text(self, input_path):
        import pdfplumber
        with pdfplumber.open(input_path) as pdf:
            text = ''
            for page in pdf.pages:
                text += page.extract_text() + '\n'
                tables = page.extract_tables()
                for table in tables:
                    for row in table:
                        text += ' | '.join(str(cell) for cell in row) + '\n'
        return text

Configuration

Option	Description	Default
`page_size`	Page dimensions: A4, letter, legal	`"A4"`
`margins`	Page margins in points	`{ top: 72, right: 72, bottom: 72, left: 72 }`
`default_font`	Default font family	`"Helvetica"`
`font_size`	Default font size in points	`12`
`compression`	Enable PDF compression	`true`
`metadata`	PDF metadata (author, title, subject)	`{}`
`encryption`	PDF password protection	`null`
`dpi`	Image resolution for embedded images	`150`

Best Practices

Use HTML-to-PDF conversion for complex layouts with CSS styling rather than positioning elements manually with coordinates — Puppeteer or wkhtmltopdf render HTML/CSS faithfully and are much faster to develop with
Embed fonts if using non-standard fonts to ensure the PDF renders correctly on any system; relying on system fonts causes display issues when recipients do not have the same fonts installed
Set proper metadata (title, author, subject, keywords) in every generated PDF to improve accessibility, searchability, and organization
Optimize file size by compressing images before embedding, using vector graphics where possible, and removing unused embedded resources from manipulated PDFs
Test PDF output with multiple viewers (Adobe Reader, Chrome, Preview) because rendering differences between viewers can cause layout issues that only appear on specific platforms

Common Issues

Text extraction returning garbled output: Some PDFs use non-standard font encodings or are scanned images without OCR text layers. Check if the PDF has a text layer with pdfplumber, and if not, run OCR with pytesseract on page images extracted from the PDF.

Coordinate system confusion: PDF coordinates start from the bottom-left corner with Y increasing upward, which is opposite to most screen coordinate systems. Use helper functions that convert from top-left coordinates to PDF coordinates: pdfY = pageHeight - screenY.

Large PDFs causing memory issues: Processing PDFs with hundreds of pages or high-resolution images consumes significant memory. Process pages one at a time instead of loading the entire document, stream output to disk rather than holding it in memory, and reduce image DPI for embedded images.

⚠️ Loading Issue

Dynamic PDF Toolkit Toolkit

PDF Toolkit

When to Use

Quick Start

Core Concepts

PDF Library Comparison

PDF Manipulation Operations

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace