D

Doc Kit

All-in-one skill covering task, involves, reading, creating. Includes structured workflows, validation checks, and reusable patterns for document processing.

SkillClipticsdocument processingv1.0.0MIT
0 views0 copies

Doc Kit

A practical skill for creating, reading, and editing DOCX documents programmatically. Covers document creation with professional formatting, content extraction, template-based generation, and converting between formats.

When to Use This Skill

Choose this skill when:

  • Creating DOCX files with tables, headers, images, and formatting
  • Extracting text content from DOCX files for analysis
  • Generating documents from templates with dynamic data
  • Converting DOCX to/from other formats (PDF, HTML, Markdown)
  • Building automated report generation pipelines

Consider alternatives when:

  • Working with PDFs → use a PDF processing skill
  • Creating presentations → use a PPTX skill
  • Working with spreadsheets → use an XLSX/spreadsheet skill
  • Need rich text editing in a web app → use a WYSIWYG editor

Quick Start

# Python DOCX tools pip install python-docx pandoc # Node.js DOCX tools npm install docx @officedev/office-addin-manifest
# Create a professional DOCX document from docx import Document from docx.shared import Inches, Pt, Cm from docx.enum.text import WD_ALIGN_PARAGRAPH from docx.enum.table import WD_TABLE_ALIGNMENT doc = Document() # Title title = doc.add_heading('Quarterly Report', level=0) title.alignment = WD_ALIGN_PARAGRAPH.CENTER # Subtitle subtitle = doc.add_paragraph('Q1 2024 Performance Summary') subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER subtitle.style.font.size = Pt(14) # Add table table = doc.add_table(rows=4, cols=3, style='Light Grid Accent 1') headers = ['Metric', 'Target', 'Actual'] for i, header in enumerate(headers): table.rows[0].cells[i].text = header data = [ ['Revenue', '$1.2M', '$1.35M'], ['Users', '10,000', '12,500'], ['Churn', '< 5%', '3.2%'], ] for row_idx, row_data in enumerate(data, 1): for col_idx, value in enumerate(row_data): table.rows[row_idx].cells[col_idx].text = value doc.save('quarterly_report.docx')

Core Concepts

DOCX Operations Matrix

OperationPython (python-docx)CLI (pandoc)
Create documentDocument()pandoc -o file.docx
Add textdoc.add_paragraph()Markdown input
Add tabledoc.add_table()Pipe-delimited in Markdown
Add imagedoc.add_picture()![](image.png)
Add headingdoc.add_heading()# Heading in Markdown
Convert to PDFlibreoffice --convert-to pdfpandoc -o file.pdf
Extract textRead paragraphs/tablespandoc -t plain

Template-Based Generation

# Template with placeholders from docx import Document import re def fill_template(template_path: str, data: dict, output_path: str): doc = Document(template_path) for paragraph in doc.paragraphs: for key, value in data.items(): placeholder = f'{{{{{key}}}}}' # {{key}} if placeholder in paragraph.text: for run in paragraph.runs: run.text = run.text.replace(placeholder, str(value)) # Also replace in tables for table in doc.tables: for row in table.rows: for cell in row.cells: for key, value in data.items(): placeholder = f'{{{{{key}}}}}' if placeholder in cell.text: cell.text = cell.text.replace(placeholder, str(value)) doc.save(output_path) # Usage fill_template('template.docx', { 'company_name': 'Acme Corp', 'date': '2024-03-15', 'total': '$1,350,000', }, 'output.docx')

Content Extraction

def extract_docx_content(path: str) -> dict: doc = Document(path) content = { 'paragraphs': [], 'tables': [], 'headings': [], } for para in doc.paragraphs: if para.style.name.startswith('Heading'): level = int(para.style.name.split()[-1]) if para.style.name[-1].isdigit() else 1 content['headings'].append({'level': level, 'text': para.text}) content['paragraphs'].append({ 'text': para.text, 'style': para.style.name, 'bold': any(run.bold for run in para.runs), }) for table in doc.tables: rows = [] for row in table.rows: rows.append([cell.text for cell in row.cells]) content['tables'].append(rows) return content

Configuration

ParameterTypeDefaultDescription
librarystring'python-docx'Library: python-docx, docx (Node), or pandoc
defaultFontstring'Calibri'Default document font
defaultFontSizenumber11Default font size (points)
pageMarginsobject{top: 1, right: 1}Page margins (inches)
tableStylestring'Light Grid'Default table style
templateDirstring'./templates'Directory for document templates

Best Practices

  1. Use templates instead of building documents from scratch — Create a properly formatted template in Word with placeholder text, then fill it programmatically. This preserves complex formatting that's difficult to reproduce in code.

  2. Work with runs, not paragraphs, for inline formatting — A paragraph can contain multiple runs with different formatting (bold, italic, different fonts). Replace text at the run level to preserve inline formatting differences.

  3. Use pandoc for format conversion rather than building converters — Pandoc handles Markdown → DOCX, DOCX → PDF, DOCX → HTML, and dozens of other format conversions with high fidelity. Don't reinvent conversion logic.

  4. Validate document structure before processing — Check that required sections, tables, and headings exist before attempting to extract or modify content. Missing structure should produce clear error messages.

  5. Test with documents from different Word versions — DOCX files from Word 2016, 2019, 365, and LibreOffice have subtle format differences. Test your code with documents from all common sources your users will provide.

Common Issues

Formatting lost when replacing text — Replacing paragraph.text directly strips all formatting. Instead, iterate over paragraph.runs and replace text within individual runs to preserve bold, italic, font, and color formatting.

Images not appearing in generated documents — Image paths must be valid at generation time. Use absolute paths or ensure relative paths resolve correctly. Check that image dimensions don't exceed page width minus margins.

Table cells with merged cells break extractionpython-docx doesn't fully support merged cells. Merged cells return the same text in both the merged and unmerged cell references. Check cell.merge properties or use pandoc for reliable table extraction.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates