A

Advanced Cf Crawl

Streamline your workflow with this crawl, entire, websites, using. Includes structured workflows, validation checks, and reusable patterns for utilities.

SkillClipticsutilitiesv1.0.0MIT
0 views0 copies

Cloudflare Crawl

A web crawling skill for extracting structured data from websites protected by Cloudflare's security features, including handling JavaScript challenges, rate limiting, and anti-bot protections.

When to Use

Choose Cloudflare Crawl when:

  • Extracting data from websites behind Cloudflare's anti-bot protection
  • Building crawlers that handle JavaScript challenges and turnstile verification
  • Setting up rate-limited, respectful crawling pipelines with retry logic
  • Converting dynamic web content to clean structured data

Consider alternatives when:

  • The target site offers an API — always prefer official APIs over scraping
  • Crawling static sites without protection — use simpler HTTP-based scrapers
  • Needing real-time data — use WebSocket feeds or streaming APIs if available

Quick Start

# Install crawling dependencies pip install cloudscraper beautifulsoup4 requests-html npm install crawlee playwright
import cloudscraper from bs4 import BeautifulSoup import time import json class CloudflareCrawler: def __init__(self, base_url, delay=2.0): self.base_url = base_url self.delay = delay self.scraper = cloudscraper.create_scraper( browser={ 'browser': 'chrome', 'platform': 'darwin', 'desktop': True } ) self.session_data = {} def fetch_page(self, url, retries=3): """Fetch a page with Cloudflare bypass and retry logic""" for attempt in range(retries): try: response = self.scraper.get(url, timeout=30) if response.status_code == 200: return response.text elif response.status_code == 403: print(f"Blocked on attempt {attempt + 1}, waiting...") time.sleep(self.delay * (attempt + 1)) elif response.status_code == 429: wait = int(response.headers.get('Retry-After', 60)) print(f"Rate limited, waiting {wait}s...") time.sleep(wait) except Exception as e: print(f"Error: {e}") time.sleep(self.delay) return None def parse_listing(self, html): """Parse a listing page into structured data""" soup = BeautifulSoup(html, 'html.parser') items = [] for card in soup.select('.item-card'): items.append({ 'title': card.select_one('.title').get_text(strip=True), 'description': card.select_one('.desc').get_text(strip=True), 'url': card.select_one('a')['href'], 'metadata': { attr.get_text(strip=True).split(':')[0]: attr.get_text(strip=True).split(':')[1] for attr in card.select('.meta-item') if ':' in attr.get_text() } }) return items def crawl_paginated(self, path, max_pages=10): """Crawl paginated listings with rate limiting""" all_items = [] for page in range(1, max_pages + 1): url = f"{self.base_url}{path}?page={page}" html = self.fetch_page(url) if not html: break items = self.parse_listing(html) if not items: break all_items.extend(items) print(f"Page {page}: {len(items)} items") time.sleep(self.delay) return all_items

Core Concepts

Cloudflare Protection Levels

LevelChallenge TypeBypass MethodDifficulty
Under AttackJS Challenge (5s page)cloudscraper, browser automationMedium
High SecurityManaged ChallengeBrowser with cookiesHigh
TurnstileCAPTCHA-like challengeManual or solving serviceVery High
Bot Fight ModeML-based detectionReal browser fingerprintVery High
Rate LimitingRequest throttlingRespectful delaysLow

Browser-Based Crawling with Playwright

import { chromium } from 'playwright'; async function crawlWithBrowser(urls: string[]) { const browser = await chromium.launch({ headless: false, // Some protections detect headless args: ['--disable-blink-features=AutomationControlled'] }); const context = await browser.newContext({ viewport: { width: 1920, height: 1080 }, userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' + 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' }); const results = []; for (const url of urls) { const page = await context.newPage(); await page.goto(url, { waitUntil: 'networkidle' }); // Wait for Cloudflare challenge to resolve await page.waitForFunction(() => { return !document.querySelector('#challenge-running'); }, { timeout: 30000 }).catch(() => {}); const data = await page.evaluate(() => { return { title: document.title, content: document.querySelector('main')?.innerHTML, links: Array.from(document.querySelectorAll('a[href]')).map(a => a.href) }; }); results.push({ url, ...data }); await page.close(); await new Promise(r => setTimeout(r, 2000)); } await browser.close(); return results; }

Configuration

OptionDescriptionDefault
base_urlTarget website base URLRequired
request_delaySeconds between requests2.0
max_retriesMaximum retry attempts per page3
timeoutRequest timeout in seconds30
user_agentBrowser user agent stringChrome latest
use_browserUse real browser vs HTTP clientfalse
respect_robotsHonor robots.txt directivestrue
max_concurrentMaximum concurrent requests1

Best Practices

  1. Always check for and respect robots.txt and terms of service before crawling any website — unauthorized scraping can result in IP bans, legal action, and service disruption for other users
  2. Implement exponential backoff for retries rather than fixed delays, doubling the wait time after each failed attempt to reduce load on the target server during issues
  3. Rotate user agents and maintain session cookies to mimic natural browsing patterns; using a single fixed user agent for thousands of requests is an obvious bot signal
  4. Cache responses locally to avoid re-fetching pages you have already processed during development and debugging — this reduces load on the target and speeds up your iteration cycle
  5. Use structured extraction with CSS selectors rather than regex on HTML, and validate extracted data against expected schemas to catch parsing issues early when the site's structure changes

Common Issues

Cloudflare JS challenge blocking requests: The 5-second challenge page requires JavaScript execution that plain HTTP clients cannot handle. Use cloudscraper for automatic challenge solving, or switch to browser automation with Playwright when cloudscraper fails on newer challenge versions.

Rate limiting causing incomplete crawls: Aggressive crawling triggers Cloudflare's rate limiter, returning 429 errors. Read the Retry-After header to determine the exact wait time, implement per-domain rate limiters, and consider running crawls during off-peak hours when rate limits may be more lenient.

Session cookies expiring mid-crawl: Cloudflare cookies have short lifetimes and need refreshing. Monitor for 403 responses during a crawl session, and when detected, revisit the site's homepage to obtain fresh challenge-clearance cookies before resuming the crawl.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates