Advanced Article Text Extractor Toolkit
All-in-one skill for managing extract full article text and metadata from web pages. Built for Claude Code with best practices and real-world patterns.
Article Text Extractor Toolkit
Intelligent web article extraction engine that pulls clean, structured content from web pages by removing ads, navigation, sidebars, and boilerplate to isolate the main article text, images, and metadata.
When to Use This Skill
Choose Article Text Extractor when:
- Extracting clean article content from cluttered web pages
- Building content aggregation or curation pipelines
- Creating read-later or offline reading archives
- Feeding article content into summarization or analysis tools
- Scraping research papers and blog posts for data collection
Consider alternatives when:
- Need full page rendering with JavaScript — use headless browsers
- Need structured data extraction — use schema.org parsers
- Scraping APIs — use API clients directly
Quick Start
# Activate article extractor claude skill activate advanced-article-text-extractor-toolkit # Extract article from URL claude "Extract the main article content from https://example.com/blog/post" # Batch extract claude "Extract articles from all URLs in urls.txt and save as markdown"
Example Extraction
import { Readability } from '@mozilla/readability'; import { JSDOM } from 'jsdom'; async function extractArticle(url: string) { const response = await fetch(url); const html = await response.text(); const dom = new JSDOM(html, { url }); const reader = new Readability(dom.window.document); const article = reader.parse(); if (!article) throw new Error('Could not extract article'); return { title: article.title, byline: article.byline, content: article.textContent, // Plain text htmlContent: article.content, // HTML with formatting excerpt: article.excerpt, siteName: article.siteName, length: article.length, // Character count publishedTime: extractPublishDate(dom.window.document), }; } function extractPublishDate(doc: Document): string | null { const selectors = [ 'meta[property="article:published_time"]', 'meta[name="date"]', 'meta[name="publish-date"]', 'time[datetime]', '.post-date', '.published-date', ]; for (const selector of selectors) { const el = doc.querySelector(selector); if (el) { return el.getAttribute('content') || el.getAttribute('datetime') || el.textContent?.trim() || null; } } return null; }
Core Concepts
Extraction Methods
| Method | Description | Best For |
|---|---|---|
| Readability (Mozilla) | DOM-based content scoring and extraction | News articles, blog posts |
| Trafilatura | Python-based with fallback strategies | Academic content, diverse sites |
| Newspaper3k | Article-specific extraction with NLP | News sites specifically |
| Custom Selectors | CSS/XPath targeting specific elements | Known site structures |
| Headless Browser | Full JavaScript rendering then extraction | SPA content, dynamic pages |
| RSS/Atom Feeds | Structured content from feed endpoints | Blogs with active feeds |
Extracted Content Fields
| Field | Source | Reliability |
|---|---|---|
| Title | <title>, og:title, <h1> | High |
| Author | author meta, byline elements | Medium |
| Published Date | article:published_time, <time> | Medium |
| Content | Main article body (cleaned HTML) | High |
| Excerpt | description meta, first paragraph | High |
| Featured Image | og:image, first content image | Medium |
| Site Name | og:site_name, domain | High |
| Language | lang attribute, Content-Language | High |
| Word Count | Calculated from extracted text | High |
# Python extraction with trafilatura import trafilatura def extract_article(url: str) -> dict: downloaded = trafilatura.fetch_url(url) if not downloaded: raise ValueError(f"Could not fetch {url}") # Extract with metadata result = trafilatura.extract( downloaded, include_comments=False, include_tables=True, include_images=True, include_links=True, output_format='json', with_metadata=True, ) return json.loads(result) if result else None # Batch extraction def extract_batch(urls: list[str], output_dir: str): for url in urls: article = extract_article(url) if article: slug = slugify(article['title']) with open(f"{output_dir}/{slug}.json", 'w') as f: json.dump(article, f, indent=2)
Configuration
| Parameter | Description | Default |
|---|---|---|
engine | Extraction engine: readability, trafilatura, newspaper | readability |
output_format | Output: markdown, html, json, text | markdown |
include_images | Include image URLs in output | true |
include_links | Preserve hyperlinks in content | true |
include_metadata | Extract author, date, site info | true |
min_content_length | Minimum characters for valid extraction | 200 |
timeout | Request timeout in seconds | 30 |
user_agent | Custom User-Agent header | Standard browser UA |
Best Practices
-
Use multiple extraction engines with fallback — Start with Readability for general articles, fall back to Trafilatura for complex layouts, and use custom selectors for known problematic sites. No single engine handles all web page structures correctly.
-
Validate extraction quality programmatically — Check that extracted content exceeds a minimum length, contains expected structural elements (paragraphs, headings), and doesn't include navigation or footer text. Automated quality checks catch extraction failures early.
-
Respect robots.txt and rate limit requests — Check robots.txt before scraping, add delays between requests (1-2 seconds minimum), and use proper User-Agent identification. Aggressive scraping gets IP addresses blocked and may violate terms of service.
-
Cache extracted content to avoid redundant fetching — Web pages change infrequently. Cache extracted articles with the URL as key and a 24-hour TTL for most content, longer for archived articles. This reduces load on target servers and speeds up your pipeline.
-
Handle JavaScript-rendered content with headless browsers — Modern SPAs render content via JavaScript that basic HTTP fetching misses. Use Playwright or Puppeteer for sites that return empty content via simple fetch, but only when necessary as headless browsers are 10x slower.
Common Issues
Extraction returns boilerplate (navigation, sidebar, footer) mixed with article content. The extraction engine misjudged content boundaries. Try a different engine, or use site-specific CSS selectors to target the article container. For recurring sites, create extraction profiles with the exact selector for the main content area.
Images are missing or have relative URLs that don't resolve. Convert relative image URLs to absolute using the article's base URL. Some sites use lazy loading with data-src instead of src attributes — check alternate attributes and decode any base64 placeholder images.
Paywall or login-required content returns truncated articles. Respect paywalls — extracting paywalled content may violate terms of service. For legitimate access, use authenticated sessions with cookies or API keys. Some sites offer full content via RSS feeds or AMP pages even when the main page is paywalled.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.