MarkItDown Complete

Convert documents, spreadsheets, presentations, and other file formats to clean Markdown text using Microsoft's MarkItDown library. This skill covers file conversion, batch processing, LLM-friendly text extraction, and integration into document processing pipelines.

When to Use This Skill

Choose MarkItDown Complete when you need to:

Convert PDFs, Word documents, or PowerPoint files to Markdown for LLM processing
Extract structured text from Excel spreadsheets or HTML pages
Build document ingestion pipelines that normalize diverse file formats
Prepare documents for RAG (Retrieval Augmented Generation) systems

Consider alternatives when:

You need OCR for scanned documents (use Tesseract or cloud OCR services)
You need to preserve exact document formatting (use format-native parsers)
You need to convert Markdown back to other formats (use Pandoc)

Quick Start


# Install MarkItDown
pip install markitdown


from markitdown import MarkItDown

md = MarkItDown()

# Convert a PDF to Markdown
result = md.convert("research_paper.pdf")
print(result.text_content)

# Convert a Word document
result = md.convert("report.docx")
with open("report.md", "w") as f:
    f.write(result.text_content)

# Convert a PowerPoint presentation
result = md.convert("slides.pptx")
print(result.text_content[:500])

Core Concepts

Supported File Formats

Format	Extension	Features Extracted
PDF	`.pdf`	Text, headings, tables (text-based PDFs)
Word	`.docx`	Headings, paragraphs, tables, lists
PowerPoint	`.pptx`	Slide text, speaker notes, titles
Excel	`.xlsx`	Sheets as Markdown tables
HTML	`.html`	Structured content with links
CSV	`.csv`	Data as Markdown table
Images	`.jpg`, `.png`	EXIF data, optional OCR
Audio	`.mp3`, `.wav`	Transcription (with speech API)
ZIP	`.zip`	Recursively converts contained files

Batch Processing Pipeline


from markitdown import MarkItDown
from pathlib import Path
import json

def batch_convert(input_dir, output_dir, extensions=None):
    """Convert all supported files in a directory to Markdown."""
    md = MarkItDown()
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    supported = extensions or [
        ".pdf", ".docx", ".pptx", ".xlsx", ".html", ".csv"
    ]

    results = []
    for file in input_path.rglob("*"):
        if file.suffix.lower() not in supported:
            continue

        try:
            result = md.convert(str(file))

            # Write Markdown output
            out_file = output_path / f"{file.stem}.md"
            out_file.write_text(result.text_content)

            results.append({
                "source": str(file),
                "output": str(out_file),
                "chars": len(result.text_content),
                "status": "success"
            })
            print(f"Converted: {file.name} ({len(result.text_content)} chars)")
        except Exception as e:
            results.append({
                "source": str(file),
                "error": str(e),
                "status": "failed"
            })
            print(f"Failed: {file.name}: {e}")

    # Save conversion report
    with open(output_path / "conversion_report.json", "w") as f:
        json.dump(results, f, indent=2)

    return results

# Convert all documents in a directory
batch_convert("./raw_documents", "./markdown_output")

LLM-Ready Document Chunking


from markitdown import MarkItDown
import re

def convert_and_chunk(filepath, chunk_size=2000, overlap=200):
    """Convert a document and split into LLM-friendly chunks."""
    md = MarkItDown()
    result = md.convert(filepath)
    text = result.text_content

    # Split by headers first
    sections = re.split(r'\n(#{1,3} .+)\n', text)

    chunks = []
    current_chunk = ""
    current_header = ""

    for i, section in enumerate(sections):
        if section.startswith("#"):
            current_header = section
            continue

        content = f"{current_header}\n{section}" if current_header else section

        if len(current_chunk) + len(content) <= chunk_size:
            current_chunk += "\n\n" + content
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = content

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Convert and chunk for RAG ingestion
chunks = convert_and_chunk("technical_manual.pdf")
print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1} ({len(chunk)} chars):")
    print(chunk[:150] + "...")

Configuration

Parameter	Description	Default
`enable_llm`	Use LLM for image descriptions	`false`
`llm_client`	OpenAI-compatible client for image analysis	`None`
`llm_model`	Model name for image descriptions	`"gpt-4o"`
`style_map`	Custom docx style-to-Markdown mappings	`None`
`page_separator`	String between PDF pages	`"\n\n---\n\n"`
`include_images`	Extract and reference images	`false`

Best Practices

Pre-filter files by type before batch processing — Not all files in a directory are convertible. Check extensions against the supported list before attempting conversion. This avoids noisy error logs and speeds up batch processing.
Handle encoding issues explicitly — Some older documents use non-UTF-8 encodings. Wrap conversions in try/except and fall back to encoding detection with chardet when the default conversion fails.
Post-process Markdown for consistency — MarkItDown's output varies by input format. Normalize heading levels, remove excessive blank lines, and standardize table formatting before feeding into downstream systems.
Chunk by semantic boundaries — When splitting converted text for LLM consumption, split on section headers or paragraph boundaries rather than fixed character counts. This preserves context within each chunk and improves retrieval quality.
Keep the original file alongside Markdown — Store a reference to the source file path or hash in each Markdown output. This enables traceability when you find an issue in the converted text and need to check the original document.

Common Issues

PDF conversion returns empty or garbled text — MarkItDown extracts text from text-based PDFs only. Scanned PDFs (images of pages) produce empty output. Check if the PDF is scanned by looking for embedded fonts; if none exist, use an OCR tool like Tesseract or Azure Document Intelligence first.

Table formatting is broken in Markdown output — Complex tables with merged cells, nested headers, or very wide columns don't convert cleanly to Markdown tables. Post-process by detecting malformed table rows (inconsistent pipe counts) and either reformatting or converting to plain text lists.

Large files cause memory errors — Converting very large PowerPoint files (100+ slides) or Excel files (100k+ rows) can exhaust memory. Process large files in chunks: for Excel, read sheet by sheet; for PowerPoint, extract slides individually. Set a file size threshold (e.g., 50 MB) and switch to streaming mode above it.

⚠️ Loading Issue

Markitdown Complete

MarkItDown Complete

When to Use This Skill

Quick Start

Core Concepts

Supported File Formats

Batch Processing Pipeline

LLM-Ready Document Chunking

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace