Web to Markdown System

A utility skill for converting web pages into clean, well-formatted Markdown documents. Covers HTML-to-Markdown conversion, content extraction, table handling, image reference management, and batch processing of multiple pages.

When to Use This Skill

Choose this skill when:

Converting web documentation into Markdown for offline reference
Extracting article content from web pages without clutter
Creating Markdown copies of competitor landing pages for analysis
Building documentation from existing web content
Converting HTML emails or reports to Markdown format

Consider alternatives when:

Scraping structured data from websites → use a web scraping skill
Taking screenshots of web pages → use a screenshot skill
Building a web crawler → use a crawling/spider skill
Converting Markdown to HTML → use a static site generator

Quick Start


# Using web2md CLI tool
web2md https://example.com/article > article.md

# Convert with options
web2md https://example.com/docs --include-images --include-links > docs.md

# Batch convert multiple pages
cat urls.txt | xargs -I {} web2md {} --output ./output/


// Programmatic HTML to Markdown conversion
import TurndownService from 'turndown';
import { gfm } from 'turndown-plugin-gfm';

const turndown = new TurndownService({
  headingStyle: 'atx',       // # style headings
  codeBlockStyle: 'fenced',  // ``` style code blocks
  bulletListMarker: '-',
  emDelimiter: '*',
});

turndown.use(gfm); // tables, strikethrough, task lists

// Custom rules for better conversion
turndown.addRule('codeWithLanguage', {
  filter: (node) => node.nodeName === 'PRE' && node.querySelector('code'),
  replacement: (content, node) => {
    const code = node.querySelector('code');
    const lang = code?.className?.match(/language-(\w+)/)?.[1] || '';
    const text = code?.textContent || content;
    return `\n\`\`\`${lang}\n${text}\n\`\`\`\n`;
  },
});

const markdown = turndown.turndown(htmlContent);

Core Concepts

Conversion Pipeline

Stage	Action	Tool
Fetch	Download HTML from URL	fetch, puppeteer (for SPAs)
Extract	Remove nav, footer, ads	Readability, cheerio
Clean	Normalize whitespace, fix encoding	Custom sanitizer
Convert	HTML → Markdown	Turndown, unified/rehype
Format	Fix heading levels, add frontmatter	Custom post-processor
Save	Write to file with proper name	fs, path

Content Extraction


import { Readability } from '@mozilla/readability';
import { JSDOM } from 'jsdom';

async function extractContent(url: string): Promise<{
  title: string;
  content: string;
  excerpt: string;
}> {
  const response = await fetch(url);
  const html = await response.text();
  const dom = new JSDOM(html, { url });
  const reader = new Readability(dom.window.document);
  const article = reader.parse();

  if (!article) throw new Error('Could not extract content');

  return {
    title: article.title,
    content: article.content,  // Clean HTML, no nav/footer/ads
    excerpt: article.excerpt,
  };
}

Table Conversion


// Convert HTML tables to Markdown tables
turndown.addRule('tableCell', {
  filter: ['th', 'td'],
  replacement: (content, node) => {
    return ` ${content.trim().replace(/\n/g, ' ')} |`;
  },
});

turndown.addRule('tableRow', {
  filter: 'tr',
  replacement: (content, node) => {
    const cells = content.trim();
    // Add header separator after first row
    if (node.parentNode?.nodeName === 'THEAD') {
      const cellCount = node.querySelectorAll('th').length;
      const separator = '|' + ' --- |'.repeat(cellCount);
      return `|${cells}\n${separator}\n`;
    }
    return `|${cells}\n`;
  },
});

// Input: <table><tr><th>Name</th><th>Age</th></tr><tr><td>Alice</td><td>30</td></tr></table>
// Output:
// | Name | Age |
// | --- | --- |
// | Alice | 30 |

Configuration

Parameter	Type	Default	Description
`headingStyle`	string	`'atx'`	Heading style: atx (#) or setext (underline)
`codeBlockStyle`	string	`'fenced'`	Code blocks: fenced (```) or indented
`bulletListMarker`	string	`'-'`	List marker: -, *, or +
`includeImages`	boolean	`true`	Include image references in output
`includeLinks`	boolean	`true`	Preserve hyperlinks in output
`extractionMode`	string	`'readability'`	Content extraction: readability, full, or selector

Best Practices

Use Readability for article extraction before conversion — Raw HTML-to-Markdown produces noisy output with navigation, footers, and ads. Readability extracts the main content, producing clean Markdown that focuses on the actual article.
Preserve code block language annotations — When converting <pre><code class="language-python">, extract the language identifier and include it in the Markdown fence: ```python. This enables syntax highlighting in Markdown renderers.
Handle relative URLs by resolving against the base URL — Images and links in HTML often use relative paths. Convert them to absolute URLs during conversion so the Markdown document works independently of the source site.
Sanitize converted output for Markdown-specific characters — Pipes in table cells, brackets in text, and angle brackets can break Markdown rendering. Escape these characters during conversion while preserving intentional Markdown formatting.
Add frontmatter with source URL and conversion date — Include metadata at the top of the converted Markdown: source URL, page title, conversion date, and author (if available). This provides attribution and enables reconversion.

Common Issues

JavaScript-rendered content missing from conversion — Fetch retrieves the initial HTML, which for SPAs contains only a loading skeleton. Use Puppeteer or Playwright to render the page fully before extracting HTML, or look for API endpoints that serve the content as JSON.

Tables with merged cells produce broken Markdown — Markdown tables don't support colspan or rowspan. Flatten merged cells by duplicating content or convert complex tables to formatted lists instead of tables.

Encoding issues with special characters — Non-UTF-8 pages or pages with HTML entities produce garbled text. Detect encoding from the Content-Type header or meta charset tag. Decode HTML entities before Markdown conversion.

⚠️ Loading Issue

Web To Markdown System

Web to Markdown System

When to Use This Skill

Quick Start

Core Concepts

Conversion Pipeline

Content Extraction

Table Conversion

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace