Doc Kit

A practical skill for creating, reading, and editing DOCX documents programmatically. Covers document creation with professional formatting, content extraction, template-based generation, and converting between formats.

When to Use This Skill

Choose this skill when:

Creating DOCX files with tables, headers, images, and formatting
Extracting text content from DOCX files for analysis
Generating documents from templates with dynamic data
Converting DOCX to/from other formats (PDF, HTML, Markdown)
Building automated report generation pipelines

Consider alternatives when:

Working with PDFs → use a PDF processing skill
Creating presentations → use a PPTX skill
Working with spreadsheets → use an XLSX/spreadsheet skill
Need rich text editing in a web app → use a WYSIWYG editor

Quick Start


# Python DOCX tools
pip install python-docx pandoc

# Node.js DOCX tools
npm install docx @officedev/office-addin-manifest


# Create a professional DOCX document
from docx import Document
from docx.shared import Inches, Pt, Cm
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT

doc = Document()

# Title
title = doc.add_heading('Quarterly Report', level=0)
title.alignment = WD_ALIGN_PARAGRAPH.CENTER

# Subtitle
subtitle = doc.add_paragraph('Q1 2024 Performance Summary')
subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER
subtitle.style.font.size = Pt(14)

# Add table
table = doc.add_table(rows=4, cols=3, style='Light Grid Accent 1')
headers = ['Metric', 'Target', 'Actual']
for i, header in enumerate(headers):
    table.rows[0].cells[i].text = header

data = [
    ['Revenue', '$1.2M', '$1.35M'],
    ['Users', '10,000', '12,500'],
    ['Churn', '< 5%', '3.2%'],
]
for row_idx, row_data in enumerate(data, 1):
    for col_idx, value in enumerate(row_data):
        table.rows[row_idx].cells[col_idx].text = value

doc.save('quarterly_report.docx')

Core Concepts

DOCX Operations Matrix

Operation	Python (python-docx)	CLI (pandoc)
Create document	`Document()`	`pandoc -o file.docx`
Add text	`doc.add_paragraph()`	Markdown input
Add table	`doc.add_table()`	Pipe-delimited in Markdown
Add image	`doc.add_picture()`	`![](image.png)`
Add heading	`doc.add_heading()`	`# Heading` in Markdown
Convert to PDF	`libreoffice --convert-to pdf`	`pandoc -o file.pdf`
Extract text	Read paragraphs/tables	`pandoc -t plain`

Template-Based Generation


# Template with placeholders
from docx import Document
import re

def fill_template(template_path: str, data: dict, output_path: str):
    doc = Document(template_path)

    for paragraph in doc.paragraphs:
        for key, value in data.items():
            placeholder = f'{{{{{key}}}}}'  # {{key}}
            if placeholder in paragraph.text:
                for run in paragraph.runs:
                    run.text = run.text.replace(placeholder, str(value))

    # Also replace in tables
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for key, value in data.items():
                    placeholder = f'{{{{{key}}}}}'
                    if placeholder in cell.text:
                        cell.text = cell.text.replace(placeholder, str(value))

    doc.save(output_path)

# Usage
fill_template('template.docx', {
    'company_name': 'Acme Corp',
    'date': '2024-03-15',
    'total': '$1,350,000',
}, 'output.docx')

Content Extraction


def extract_docx_content(path: str) -> dict:
    doc = Document(path)
    content = {
        'paragraphs': [],
        'tables': [],
        'headings': [],
    }

    for para in doc.paragraphs:
        if para.style.name.startswith('Heading'):
            level = int(para.style.name.split()[-1]) if para.style.name[-1].isdigit() else 1
            content['headings'].append({'level': level, 'text': para.text})
        content['paragraphs'].append({
            'text': para.text,
            'style': para.style.name,
            'bold': any(run.bold for run in para.runs),
        })

    for table in doc.tables:
        rows = []
        for row in table.rows:
            rows.append([cell.text for cell in row.cells])
        content['tables'].append(rows)

    return content

Configuration

Parameter	Type	Default	Description
`library`	string	`'python-docx'`	Library: python-docx, docx (Node), or pandoc
`defaultFont`	string	`'Calibri'`	Default document font
`defaultFontSize`	number	`11`	Default font size (points)
`pageMargins`	object	`{top: 1, right: 1}`	Page margins (inches)
`tableStyle`	string	`'Light Grid'`	Default table style
`templateDir`	string	`'./templates'`	Directory for document templates

Best Practices

Use templates instead of building documents from scratch — Create a properly formatted template in Word with placeholder text, then fill it programmatically. This preserves complex formatting that's difficult to reproduce in code.
Work with runs, not paragraphs, for inline formatting — A paragraph can contain multiple runs with different formatting (bold, italic, different fonts). Replace text at the run level to preserve inline formatting differences.
Use pandoc for format conversion rather than building converters — Pandoc handles Markdown → DOCX, DOCX → PDF, DOCX → HTML, and dozens of other format conversions with high fidelity. Don't reinvent conversion logic.
Validate document structure before processing — Check that required sections, tables, and headings exist before attempting to extract or modify content. Missing structure should produce clear error messages.
Test with documents from different Word versions — DOCX files from Word 2016, 2019, 365, and LibreOffice have subtle format differences. Test your code with documents from all common sources your users will provide.

Common Issues

Formatting lost when replacing text — Replacing paragraph.text directly strips all formatting. Instead, iterate over paragraph.runs and replace text within individual runs to preserve bold, italic, font, and color formatting.

Images not appearing in generated documents — Image paths must be valid at generation time. Use absolute paths or ensure relative paths resolve correctly. Check that image dimensions don't exceed page width minus margins.

Table cells with merged cells break extraction — python-docx doesn't fully support merged cells. Merged cells return the same text in both the merged and unmerged cell references. Check cell.merge properties or use pandoc for reliable table extraction.

⚠️ Loading Issue

Doc Kit

Doc Kit

When to Use This Skill

Quick Start

Core Concepts

DOCX Operations Matrix

Template-Based Generation

Content Extraction

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace