Article Text Extractor Toolkit

Intelligent web article extraction engine that pulls clean, structured content from web pages by removing ads, navigation, sidebars, and boilerplate to isolate the main article text, images, and metadata.

When to Use This Skill

Choose Article Text Extractor when:

Extracting clean article content from cluttered web pages
Building content aggregation or curation pipelines
Creating read-later or offline reading archives
Feeding article content into summarization or analysis tools
Scraping research papers and blog posts for data collection

Consider alternatives when:

Need full page rendering with JavaScript — use headless browsers
Need structured data extraction — use schema.org parsers
Scraping APIs — use API clients directly

Quick Start


# Activate article extractor
claude skill activate advanced-article-text-extractor-toolkit

# Extract article from URL
claude "Extract the main article content from https://example.com/blog/post"

# Batch extract
claude "Extract articles from all URLs in urls.txt and save as markdown"

Example Extraction


import { Readability } from '@mozilla/readability';
import { JSDOM } from 'jsdom';

async function extractArticle(url: string) {
  const response = await fetch(url);
  const html = await response.text();
  const dom = new JSDOM(html, { url });

  const reader = new Readability(dom.window.document);
  const article = reader.parse();

  if (!article) throw new Error('Could not extract article');

  return {
    title: article.title,
    byline: article.byline,
    content: article.textContent,    // Plain text
    htmlContent: article.content,     // HTML with formatting
    excerpt: article.excerpt,
    siteName: article.siteName,
    length: article.length,          // Character count
    publishedTime: extractPublishDate(dom.window.document),
  };
}

function extractPublishDate(doc: Document): string | null {
  const selectors = [
    'meta[property="article:published_time"]',
    'meta[name="date"]',
    'meta[name="publish-date"]',
    'time[datetime]',
    '.post-date',
    '.published-date',
  ];

  for (const selector of selectors) {
    const el = doc.querySelector(selector);
    if (el) {
      return el.getAttribute('content') ||
             el.getAttribute('datetime') ||
             el.textContent?.trim() || null;
    }
  }
  return null;
}

Core Concepts

Extraction Methods

Method	Description	Best For
Readability (Mozilla)	DOM-based content scoring and extraction	News articles, blog posts
Trafilatura	Python-based with fallback strategies	Academic content, diverse sites
Newspaper3k	Article-specific extraction with NLP	News sites specifically
Custom Selectors	CSS/XPath targeting specific elements	Known site structures
Headless Browser	Full JavaScript rendering then extraction	SPA content, dynamic pages
RSS/Atom Feeds	Structured content from feed endpoints	Blogs with active feeds

Extracted Content Fields

Field	Source	Reliability
Title	`<title>`, `og:title`, `<h1>`	High
Author	`author` meta, byline elements	Medium
Published Date	`article:published_time`, `<time>`	Medium
Content	Main article body (cleaned HTML)	High
Excerpt	`description` meta, first paragraph	High
Featured Image	`og:image`, first content image	Medium
Site Name	`og:site_name`, domain	High
Language	`lang` attribute, `Content-Language`	High
Word Count	Calculated from extracted text	High


# Python extraction with trafilatura
import trafilatura

def extract_article(url: str) -> dict:
    downloaded = trafilatura.fetch_url(url)
    if not downloaded:
        raise ValueError(f"Could not fetch {url}")

    # Extract with metadata
    result = trafilatura.extract(
        downloaded,
        include_comments=False,
        include_tables=True,
        include_images=True,
        include_links=True,
        output_format='json',
        with_metadata=True,
    )

    return json.loads(result) if result else None

# Batch extraction
def extract_batch(urls: list[str], output_dir: str):
    for url in urls:
        article = extract_article(url)
        if article:
            slug = slugify(article['title'])
            with open(f"{output_dir}/{slug}.json", 'w') as f:
                json.dump(article, f, indent=2)

Configuration

Parameter	Description	Default
`engine`	Extraction engine: readability, trafilatura, newspaper	`readability`
`output_format`	Output: markdown, html, json, text	`markdown`
`include_images`	Include image URLs in output	`true`
`include_links`	Preserve hyperlinks in content	`true`
`include_metadata`	Extract author, date, site info	`true`
`min_content_length`	Minimum characters for valid extraction	`200`
`timeout`	Request timeout in seconds	`30`
`user_agent`	Custom User-Agent header	Standard browser UA

Best Practices

Use multiple extraction engines with fallback — Start with Readability for general articles, fall back to Trafilatura for complex layouts, and use custom selectors for known problematic sites. No single engine handles all web page structures correctly.
Validate extraction quality programmatically — Check that extracted content exceeds a minimum length, contains expected structural elements (paragraphs, headings), and doesn't include navigation or footer text. Automated quality checks catch extraction failures early.
Respect robots.txt and rate limit requests — Check robots.txt before scraping, add delays between requests (1-2 seconds minimum), and use proper User-Agent identification. Aggressive scraping gets IP addresses blocked and may violate terms of service.
Cache extracted content to avoid redundant fetching — Web pages change infrequently. Cache extracted articles with the URL as key and a 24-hour TTL for most content, longer for archived articles. This reduces load on target servers and speeds up your pipeline.
Handle JavaScript-rendered content with headless browsers — Modern SPAs render content via JavaScript that basic HTTP fetching misses. Use Playwright or Puppeteer for sites that return empty content via simple fetch, but only when necessary as headless browsers are 10x slower.

Common Issues

Extraction returns boilerplate (navigation, sidebar, footer) mixed with article content. The extraction engine misjudged content boundaries. Try a different engine, or use site-specific CSS selectors to target the article container. For recurring sites, create extraction profiles with the exact selector for the main content area.

Images are missing or have relative URLs that don't resolve. Convert relative image URLs to absolute using the article's base URL. Some sites use lazy loading with data-src instead of src attributes — check alternate attributes and decode any base64 placeholder images.

Paywall or login-required content returns truncated articles. Respect paywalls — extracting paywalled content may violate terms of service. For legitimate access, use authenticated sessions with cookies or API keys. Some sites offer full content via RSS feeds or AMP pages even when the main page is paywalled.

⚠️ Loading Issue

Advanced Article Text Extractor Toolkit

Article Text Extractor Toolkit

When to Use This Skill

Quick Start

Example Extraction

Core Concepts

Extraction Methods

Extracted Content Fields

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace