M

Markitdown Complete

Battle-tested skill for convert, files, office, documents. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

MarkItDown Complete

Convert documents, spreadsheets, presentations, and other file formats to clean Markdown text using Microsoft's MarkItDown library. This skill covers file conversion, batch processing, LLM-friendly text extraction, and integration into document processing pipelines.

When to Use This Skill

Choose MarkItDown Complete when you need to:

  • Convert PDFs, Word documents, or PowerPoint files to Markdown for LLM processing
  • Extract structured text from Excel spreadsheets or HTML pages
  • Build document ingestion pipelines that normalize diverse file formats
  • Prepare documents for RAG (Retrieval Augmented Generation) systems

Consider alternatives when:

  • You need OCR for scanned documents (use Tesseract or cloud OCR services)
  • You need to preserve exact document formatting (use format-native parsers)
  • You need to convert Markdown back to other formats (use Pandoc)

Quick Start

# Install MarkItDown pip install markitdown
from markitdown import MarkItDown md = MarkItDown() # Convert a PDF to Markdown result = md.convert("research_paper.pdf") print(result.text_content) # Convert a Word document result = md.convert("report.docx") with open("report.md", "w") as f: f.write(result.text_content) # Convert a PowerPoint presentation result = md.convert("slides.pptx") print(result.text_content[:500])

Core Concepts

Supported File Formats

FormatExtensionFeatures Extracted
PDF.pdfText, headings, tables (text-based PDFs)
Word.docxHeadings, paragraphs, tables, lists
PowerPoint.pptxSlide text, speaker notes, titles
Excel.xlsxSheets as Markdown tables
HTML.htmlStructured content with links
CSV.csvData as Markdown table
Images.jpg, .pngEXIF data, optional OCR
Audio.mp3, .wavTranscription (with speech API)
ZIP.zipRecursively converts contained files

Batch Processing Pipeline

from markitdown import MarkItDown from pathlib import Path import json def batch_convert(input_dir, output_dir, extensions=None): """Convert all supported files in a directory to Markdown.""" md = MarkItDown() input_path = Path(input_dir) output_path = Path(output_dir) output_path.mkdir(parents=True, exist_ok=True) supported = extensions or [ ".pdf", ".docx", ".pptx", ".xlsx", ".html", ".csv" ] results = [] for file in input_path.rglob("*"): if file.suffix.lower() not in supported: continue try: result = md.convert(str(file)) # Write Markdown output out_file = output_path / f"{file.stem}.md" out_file.write_text(result.text_content) results.append({ "source": str(file), "output": str(out_file), "chars": len(result.text_content), "status": "success" }) print(f"Converted: {file.name} ({len(result.text_content)} chars)") except Exception as e: results.append({ "source": str(file), "error": str(e), "status": "failed" }) print(f"Failed: {file.name}: {e}") # Save conversion report with open(output_path / "conversion_report.json", "w") as f: json.dump(results, f, indent=2) return results # Convert all documents in a directory batch_convert("./raw_documents", "./markdown_output")

LLM-Ready Document Chunking

from markitdown import MarkItDown import re def convert_and_chunk(filepath, chunk_size=2000, overlap=200): """Convert a document and split into LLM-friendly chunks.""" md = MarkItDown() result = md.convert(filepath) text = result.text_content # Split by headers first sections = re.split(r'\n(#{1,3} .+)\n', text) chunks = [] current_chunk = "" current_header = "" for i, section in enumerate(sections): if section.startswith("#"): current_header = section continue content = f"{current_header}\n{section}" if current_header else section if len(current_chunk) + len(content) <= chunk_size: current_chunk += "\n\n" + content else: if current_chunk: chunks.append(current_chunk.strip()) current_chunk = content if current_chunk: chunks.append(current_chunk.strip()) return chunks # Convert and chunk for RAG ingestion chunks = convert_and_chunk("technical_manual.pdf") print(f"Created {len(chunks)} chunks") for i, chunk in enumerate(chunks[:3]): print(f"\nChunk {i+1} ({len(chunk)} chars):") print(chunk[:150] + "...")

Configuration

ParameterDescriptionDefault
enable_llmUse LLM for image descriptionsfalse
llm_clientOpenAI-compatible client for image analysisNone
llm_modelModel name for image descriptions"gpt-4o"
style_mapCustom docx style-to-Markdown mappingsNone
page_separatorString between PDF pages"\n\n---\n\n"
include_imagesExtract and reference imagesfalse

Best Practices

  1. Pre-filter files by type before batch processing — Not all files in a directory are convertible. Check extensions against the supported list before attempting conversion. This avoids noisy error logs and speeds up batch processing.

  2. Handle encoding issues explicitly — Some older documents use non-UTF-8 encodings. Wrap conversions in try/except and fall back to encoding detection with chardet when the default conversion fails.

  3. Post-process Markdown for consistency — MarkItDown's output varies by input format. Normalize heading levels, remove excessive blank lines, and standardize table formatting before feeding into downstream systems.

  4. Chunk by semantic boundaries — When splitting converted text for LLM consumption, split on section headers or paragraph boundaries rather than fixed character counts. This preserves context within each chunk and improves retrieval quality.

  5. Keep the original file alongside Markdown — Store a reference to the source file path or hash in each Markdown output. This enables traceability when you find an issue in the converted text and need to check the original document.

Common Issues

PDF conversion returns empty or garbled text — MarkItDown extracts text from text-based PDFs only. Scanned PDFs (images of pages) produce empty output. Check if the PDF is scanned by looking for embedded fonts; if none exist, use an OCR tool like Tesseract or Azure Document Intelligence first.

Table formatting is broken in Markdown output — Complex tables with merged cells, nested headers, or very wide columns don't convert cleanly to Markdown tables. Post-process by detecting malformed table rows (inconsistent pipe counts) and either reformatting or converting to plain text lists.

Large files cause memory errors — Converting very large PowerPoint files (100+ slides) or Excel files (100k+ rows) can exhaust memory. Process large files in chunks: for Excel, read sheet by sheet; for PowerPoint, extract slides individually. Set a file size threshold (e.g., 50 MB) and switch to streaming mode above it.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates