W

Web To Markdown System

Enterprise-grade skill for only, user, explicitly, says. Includes structured workflows, validation checks, and reusable patterns for development.

SkillClipticsdevelopmentv1.0.0MIT
0 views0 copies

Web to Markdown System

A utility skill for converting web pages into clean, well-formatted Markdown documents. Covers HTML-to-Markdown conversion, content extraction, table handling, image reference management, and batch processing of multiple pages.

When to Use This Skill

Choose this skill when:

  • Converting web documentation into Markdown for offline reference
  • Extracting article content from web pages without clutter
  • Creating Markdown copies of competitor landing pages for analysis
  • Building documentation from existing web content
  • Converting HTML emails or reports to Markdown format

Consider alternatives when:

  • Scraping structured data from websites → use a web scraping skill
  • Taking screenshots of web pages → use a screenshot skill
  • Building a web crawler → use a crawling/spider skill
  • Converting Markdown to HTML → use a static site generator

Quick Start

# Using web2md CLI tool web2md https://example.com/article > article.md # Convert with options web2md https://example.com/docs --include-images --include-links > docs.md # Batch convert multiple pages cat urls.txt | xargs -I {} web2md {} --output ./output/
// Programmatic HTML to Markdown conversion import TurndownService from 'turndown'; import { gfm } from 'turndown-plugin-gfm'; const turndown = new TurndownService({ headingStyle: 'atx', // # style headings codeBlockStyle: 'fenced', // ``` style code blocks bulletListMarker: '-', emDelimiter: '*', }); turndown.use(gfm); // tables, strikethrough, task lists // Custom rules for better conversion turndown.addRule('codeWithLanguage', { filter: (node) => node.nodeName === 'PRE' && node.querySelector('code'), replacement: (content, node) => { const code = node.querySelector('code'); const lang = code?.className?.match(/language-(\w+)/)?.[1] || ''; const text = code?.textContent || content; return `\n\`\`\`${lang}\n${text}\n\`\`\`\n`; }, }); const markdown = turndown.turndown(htmlContent);

Core Concepts

Conversion Pipeline

StageActionTool
FetchDownload HTML from URLfetch, puppeteer (for SPAs)
ExtractRemove nav, footer, adsReadability, cheerio
CleanNormalize whitespace, fix encodingCustom sanitizer
ConvertHTML → MarkdownTurndown, unified/rehype
FormatFix heading levels, add frontmatterCustom post-processor
SaveWrite to file with proper namefs, path

Content Extraction

import { Readability } from '@mozilla/readability'; import { JSDOM } from 'jsdom'; async function extractContent(url: string): Promise<{ title: string; content: string; excerpt: string; }> { const response = await fetch(url); const html = await response.text(); const dom = new JSDOM(html, { url }); const reader = new Readability(dom.window.document); const article = reader.parse(); if (!article) throw new Error('Could not extract content'); return { title: article.title, content: article.content, // Clean HTML, no nav/footer/ads excerpt: article.excerpt, }; }

Table Conversion

// Convert HTML tables to Markdown tables turndown.addRule('tableCell', { filter: ['th', 'td'], replacement: (content, node) => { return ` ${content.trim().replace(/\n/g, ' ')} |`; }, }); turndown.addRule('tableRow', { filter: 'tr', replacement: (content, node) => { const cells = content.trim(); // Add header separator after first row if (node.parentNode?.nodeName === 'THEAD') { const cellCount = node.querySelectorAll('th').length; const separator = '|' + ' --- |'.repeat(cellCount); return `|${cells}\n${separator}\n`; } return `|${cells}\n`; }, }); // Input: <table><tr><th>Name</th><th>Age</th></tr><tr><td>Alice</td><td>30</td></tr></table> // Output: // | Name | Age | // | --- | --- | // | Alice | 30 |

Configuration

ParameterTypeDefaultDescription
headingStylestring'atx'Heading style: atx (#) or setext (underline)
codeBlockStylestring'fenced'Code blocks: fenced (```) or indented
bulletListMarkerstring'-'List marker: -, *, or +
includeImagesbooleantrueInclude image references in output
includeLinksbooleantruePreserve hyperlinks in output
extractionModestring'readability'Content extraction: readability, full, or selector

Best Practices

  1. Use Readability for article extraction before conversion — Raw HTML-to-Markdown produces noisy output with navigation, footers, and ads. Readability extracts the main content, producing clean Markdown that focuses on the actual article.

  2. Preserve code block language annotations — When converting <pre><code class="language-python">, extract the language identifier and include it in the Markdown fence: ```python. This enables syntax highlighting in Markdown renderers.

  3. Handle relative URLs by resolving against the base URL — Images and links in HTML often use relative paths. Convert them to absolute URLs during conversion so the Markdown document works independently of the source site.

  4. Sanitize converted output for Markdown-specific characters — Pipes in table cells, brackets in text, and angle brackets can break Markdown rendering. Escape these characters during conversion while preserving intentional Markdown formatting.

  5. Add frontmatter with source URL and conversion date — Include metadata at the top of the converted Markdown: source URL, page title, conversion date, and author (if available). This provides attribution and enables reconversion.

Common Issues

JavaScript-rendered content missing from conversion — Fetch retrieves the initial HTML, which for SPAs contains only a loading skeleton. Use Puppeteer or Playwright to render the page fully before extracting HTML, or look for API endpoints that serve the content as JSON.

Tables with merged cells produce broken Markdown — Markdown tables don't support colspan or rowspan. Flatten merged cells by duplicating content or convert complex tables to formatted lists instead of tables.

Encoding issues with special characters — Non-UTF-8 pages or pages with HTML entities produce garbled text. Detect encoding from the Content-Type header or meta charset tag. Decode HTML entities before Markdown conversion.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates