Web To Markdown System
Enterprise-grade skill for only, user, explicitly, says. Includes structured workflows, validation checks, and reusable patterns for development.
Web to Markdown System
A utility skill for converting web pages into clean, well-formatted Markdown documents. Covers HTML-to-Markdown conversion, content extraction, table handling, image reference management, and batch processing of multiple pages.
When to Use This Skill
Choose this skill when:
- Converting web documentation into Markdown for offline reference
- Extracting article content from web pages without clutter
- Creating Markdown copies of competitor landing pages for analysis
- Building documentation from existing web content
- Converting HTML emails or reports to Markdown format
Consider alternatives when:
- Scraping structured data from websites → use a web scraping skill
- Taking screenshots of web pages → use a screenshot skill
- Building a web crawler → use a crawling/spider skill
- Converting Markdown to HTML → use a static site generator
Quick Start
# Using web2md CLI tool web2md https://example.com/article > article.md # Convert with options web2md https://example.com/docs --include-images --include-links > docs.md # Batch convert multiple pages cat urls.txt | xargs -I {} web2md {} --output ./output/
// Programmatic HTML to Markdown conversion import TurndownService from 'turndown'; import { gfm } from 'turndown-plugin-gfm'; const turndown = new TurndownService({ headingStyle: 'atx', // # style headings codeBlockStyle: 'fenced', // ``` style code blocks bulletListMarker: '-', emDelimiter: '*', }); turndown.use(gfm); // tables, strikethrough, task lists // Custom rules for better conversion turndown.addRule('codeWithLanguage', { filter: (node) => node.nodeName === 'PRE' && node.querySelector('code'), replacement: (content, node) => { const code = node.querySelector('code'); const lang = code?.className?.match(/language-(\w+)/)?.[1] || ''; const text = code?.textContent || content; return `\n\`\`\`${lang}\n${text}\n\`\`\`\n`; }, }); const markdown = turndown.turndown(htmlContent);
Core Concepts
Conversion Pipeline
| Stage | Action | Tool |
|---|---|---|
| Fetch | Download HTML from URL | fetch, puppeteer (for SPAs) |
| Extract | Remove nav, footer, ads | Readability, cheerio |
| Clean | Normalize whitespace, fix encoding | Custom sanitizer |
| Convert | HTML → Markdown | Turndown, unified/rehype |
| Format | Fix heading levels, add frontmatter | Custom post-processor |
| Save | Write to file with proper name | fs, path |
Content Extraction
import { Readability } from '@mozilla/readability'; import { JSDOM } from 'jsdom'; async function extractContent(url: string): Promise<{ title: string; content: string; excerpt: string; }> { const response = await fetch(url); const html = await response.text(); const dom = new JSDOM(html, { url }); const reader = new Readability(dom.window.document); const article = reader.parse(); if (!article) throw new Error('Could not extract content'); return { title: article.title, content: article.content, // Clean HTML, no nav/footer/ads excerpt: article.excerpt, }; }
Table Conversion
// Convert HTML tables to Markdown tables turndown.addRule('tableCell', { filter: ['th', 'td'], replacement: (content, node) => { return ` ${content.trim().replace(/\n/g, ' ')} |`; }, }); turndown.addRule('tableRow', { filter: 'tr', replacement: (content, node) => { const cells = content.trim(); // Add header separator after first row if (node.parentNode?.nodeName === 'THEAD') { const cellCount = node.querySelectorAll('th').length; const separator = '|' + ' --- |'.repeat(cellCount); return `|${cells}\n${separator}\n`; } return `|${cells}\n`; }, }); // Input: <table><tr><th>Name</th><th>Age</th></tr><tr><td>Alice</td><td>30</td></tr></table> // Output: // | Name | Age | // | --- | --- | // | Alice | 30 |
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
headingStyle | string | 'atx' | Heading style: atx (#) or setext (underline) |
codeBlockStyle | string | 'fenced' | Code blocks: fenced (```) or indented |
bulletListMarker | string | '-' | List marker: -, *, or + |
includeImages | boolean | true | Include image references in output |
includeLinks | boolean | true | Preserve hyperlinks in output |
extractionMode | string | 'readability' | Content extraction: readability, full, or selector |
Best Practices
-
Use Readability for article extraction before conversion — Raw HTML-to-Markdown produces noisy output with navigation, footers, and ads. Readability extracts the main content, producing clean Markdown that focuses on the actual article.
-
Preserve code block language annotations — When converting
<pre><code class="language-python">, extract the language identifier and include it in the Markdown fence:```python. This enables syntax highlighting in Markdown renderers. -
Handle relative URLs by resolving against the base URL — Images and links in HTML often use relative paths. Convert them to absolute URLs during conversion so the Markdown document works independently of the source site.
-
Sanitize converted output for Markdown-specific characters — Pipes in table cells, brackets in text, and angle brackets can break Markdown rendering. Escape these characters during conversion while preserving intentional Markdown formatting.
-
Add frontmatter with source URL and conversion date — Include metadata at the top of the converted Markdown: source URL, page title, conversion date, and author (if available). This provides attribution and enables reconversion.
Common Issues
JavaScript-rendered content missing from conversion — Fetch retrieves the initial HTML, which for SPAs contains only a loading skeleton. Use Puppeteer or Playwright to render the page fully before extracting HTML, or look for API endpoints that serve the content as JSON.
Tables with merged cells produce broken Markdown — Markdown tables don't support colspan or rowspan. Flatten merged cells by duplicating content or convert complex tables to formatted lists instead of tables.
Encoding issues with special characters — Non-UTF-8 pages or pages with HTML entities produce garbled text. Detect encoding from the Content-Type header or meta charset tag. Decode HTML entities before Markdown conversion.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.