A

Advanced Article Text Extractor Toolkit

All-in-one skill for managing extract full article text and metadata from web pages. Built for Claude Code with best practices and real-world patterns.

SkillCommunitycontentv1.0.0MIT
0 views0 copies

Article Text Extractor Toolkit

Intelligent web article extraction engine that pulls clean, structured content from web pages by removing ads, navigation, sidebars, and boilerplate to isolate the main article text, images, and metadata.

When to Use This Skill

Choose Article Text Extractor when:

  • Extracting clean article content from cluttered web pages
  • Building content aggregation or curation pipelines
  • Creating read-later or offline reading archives
  • Feeding article content into summarization or analysis tools
  • Scraping research papers and blog posts for data collection

Consider alternatives when:

  • Need full page rendering with JavaScript — use headless browsers
  • Need structured data extraction — use schema.org parsers
  • Scraping APIs — use API clients directly

Quick Start

# Activate article extractor claude skill activate advanced-article-text-extractor-toolkit # Extract article from URL claude "Extract the main article content from https://example.com/blog/post" # Batch extract claude "Extract articles from all URLs in urls.txt and save as markdown"

Example Extraction

import { Readability } from '@mozilla/readability'; import { JSDOM } from 'jsdom'; async function extractArticle(url: string) { const response = await fetch(url); const html = await response.text(); const dom = new JSDOM(html, { url }); const reader = new Readability(dom.window.document); const article = reader.parse(); if (!article) throw new Error('Could not extract article'); return { title: article.title, byline: article.byline, content: article.textContent, // Plain text htmlContent: article.content, // HTML with formatting excerpt: article.excerpt, siteName: article.siteName, length: article.length, // Character count publishedTime: extractPublishDate(dom.window.document), }; } function extractPublishDate(doc: Document): string | null { const selectors = [ 'meta[property="article:published_time"]', 'meta[name="date"]', 'meta[name="publish-date"]', 'time[datetime]', '.post-date', '.published-date', ]; for (const selector of selectors) { const el = doc.querySelector(selector); if (el) { return el.getAttribute('content') || el.getAttribute('datetime') || el.textContent?.trim() || null; } } return null; }

Core Concepts

Extraction Methods

MethodDescriptionBest For
Readability (Mozilla)DOM-based content scoring and extractionNews articles, blog posts
TrafilaturaPython-based with fallback strategiesAcademic content, diverse sites
Newspaper3kArticle-specific extraction with NLPNews sites specifically
Custom SelectorsCSS/XPath targeting specific elementsKnown site structures
Headless BrowserFull JavaScript rendering then extractionSPA content, dynamic pages
RSS/Atom FeedsStructured content from feed endpointsBlogs with active feeds

Extracted Content Fields

FieldSourceReliability
Title<title>, og:title, <h1>High
Authorauthor meta, byline elementsMedium
Published Datearticle:published_time, <time>Medium
ContentMain article body (cleaned HTML)High
Excerptdescription meta, first paragraphHigh
Featured Imageog:image, first content imageMedium
Site Nameog:site_name, domainHigh
Languagelang attribute, Content-LanguageHigh
Word CountCalculated from extracted textHigh
# Python extraction with trafilatura import trafilatura def extract_article(url: str) -> dict: downloaded = trafilatura.fetch_url(url) if not downloaded: raise ValueError(f"Could not fetch {url}") # Extract with metadata result = trafilatura.extract( downloaded, include_comments=False, include_tables=True, include_images=True, include_links=True, output_format='json', with_metadata=True, ) return json.loads(result) if result else None # Batch extraction def extract_batch(urls: list[str], output_dir: str): for url in urls: article = extract_article(url) if article: slug = slugify(article['title']) with open(f"{output_dir}/{slug}.json", 'w') as f: json.dump(article, f, indent=2)

Configuration

ParameterDescriptionDefault
engineExtraction engine: readability, trafilatura, newspaperreadability
output_formatOutput: markdown, html, json, textmarkdown
include_imagesInclude image URLs in outputtrue
include_linksPreserve hyperlinks in contenttrue
include_metadataExtract author, date, site infotrue
min_content_lengthMinimum characters for valid extraction200
timeoutRequest timeout in seconds30
user_agentCustom User-Agent headerStandard browser UA

Best Practices

  1. Use multiple extraction engines with fallback — Start with Readability for general articles, fall back to Trafilatura for complex layouts, and use custom selectors for known problematic sites. No single engine handles all web page structures correctly.

  2. Validate extraction quality programmatically — Check that extracted content exceeds a minimum length, contains expected structural elements (paragraphs, headings), and doesn't include navigation or footer text. Automated quality checks catch extraction failures early.

  3. Respect robots.txt and rate limit requests — Check robots.txt before scraping, add delays between requests (1-2 seconds minimum), and use proper User-Agent identification. Aggressive scraping gets IP addresses blocked and may violate terms of service.

  4. Cache extracted content to avoid redundant fetching — Web pages change infrequently. Cache extracted articles with the URL as key and a 24-hour TTL for most content, longer for archived articles. This reduces load on target servers and speeds up your pipeline.

  5. Handle JavaScript-rendered content with headless browsers — Modern SPAs render content via JavaScript that basic HTTP fetching misses. Use Playwright or Puppeteer for sites that return empty content via simple fetch, but only when necessary as headless browsers are 10x slower.

Common Issues

Extraction returns boilerplate (navigation, sidebar, footer) mixed with article content. The extraction engine misjudged content boundaries. Try a different engine, or use site-specific CSS selectors to target the article container. For recurring sites, create extraction profiles with the exact selector for the main content area.

Images are missing or have relative URLs that don't resolve. Convert relative image URLs to absolute using the article's base URL. Some sites use lazy loading with data-src instead of src attributes — check alternate attributes and decode any base64 placeholder images.

Paywall or login-required content returns truncated articles. Respect paywalls — extracting paywalled content may violate terms of service. For legitimate access, use authenticated sessions with cookies or API keys. Some sites offer full content via RSS feeds or AMP pages even when the main page is paywalled.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates