Cloudflare Crawl

A web crawling skill for extracting structured data from websites protected by Cloudflare's security features, including handling JavaScript challenges, rate limiting, and anti-bot protections.

When to Use

Choose Cloudflare Crawl when:

Extracting data from websites behind Cloudflare's anti-bot protection
Building crawlers that handle JavaScript challenges and turnstile verification
Setting up rate-limited, respectful crawling pipelines with retry logic
Converting dynamic web content to clean structured data

Consider alternatives when:

The target site offers an API — always prefer official APIs over scraping
Crawling static sites without protection — use simpler HTTP-based scrapers
Needing real-time data — use WebSocket feeds or streaming APIs if available

Quick Start


# Install crawling dependencies
pip install cloudscraper beautifulsoup4 requests-html
npm install crawlee playwright


import cloudscraper
from bs4 import BeautifulSoup
import time
import json

class CloudflareCrawler:
    def __init__(self, base_url, delay=2.0):
        self.base_url = base_url
        self.delay = delay
        self.scraper = cloudscraper.create_scraper(
            browser={
                'browser': 'chrome',
                'platform': 'darwin',
                'desktop': True
            }
        )
        self.session_data = {}

    def fetch_page(self, url, retries=3):
        """Fetch a page with Cloudflare bypass and retry logic"""
        for attempt in range(retries):
            try:
                response = self.scraper.get(url, timeout=30)
                if response.status_code == 200:
                    return response.text
                elif response.status_code == 403:
                    print(f"Blocked on attempt {attempt + 1}, waiting...")
                    time.sleep(self.delay * (attempt + 1))
                elif response.status_code == 429:
                    wait = int(response.headers.get('Retry-After', 60))
                    print(f"Rate limited, waiting {wait}s...")
                    time.sleep(wait)
            except Exception as e:
                print(f"Error: {e}")
                time.sleep(self.delay)
        return None

    def parse_listing(self, html):
        """Parse a listing page into structured data"""
        soup = BeautifulSoup(html, 'html.parser')
        items = []
        for card in soup.select('.item-card'):
            items.append({
                'title': card.select_one('.title').get_text(strip=True),
                'description': card.select_one('.desc').get_text(strip=True),
                'url': card.select_one('a')['href'],
                'metadata': {
                    attr.get_text(strip=True).split(':')[0]:
                    attr.get_text(strip=True).split(':')[1]
                    for attr in card.select('.meta-item')
                    if ':' in attr.get_text()
                }
            })
        return items

    def crawl_paginated(self, path, max_pages=10):
        """Crawl paginated listings with rate limiting"""
        all_items = []
        for page in range(1, max_pages + 1):
            url = f"{self.base_url}{path}?page={page}"
            html = self.fetch_page(url)
            if not html:
                break
            items = self.parse_listing(html)
            if not items:
                break
            all_items.extend(items)
            print(f"Page {page}: {len(items)} items")
            time.sleep(self.delay)
        return all_items

Core Concepts

Cloudflare Protection Levels

Level	Challenge Type	Bypass Method	Difficulty
Under Attack	JS Challenge (5s page)	cloudscraper, browser automation	Medium
High Security	Managed Challenge	Browser with cookies	High
Turnstile	CAPTCHA-like challenge	Manual or solving service	Very High
Bot Fight Mode	ML-based detection	Real browser fingerprint	Very High
Rate Limiting	Request throttling	Respectful delays	Low

Browser-Based Crawling with Playwright


import { chromium } from 'playwright';

async function crawlWithBrowser(urls: string[]) {
  const browser = await chromium.launch({
    headless: false, // Some protections detect headless
    args: ['--disable-blink-features=AutomationControlled']
  });

  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
               'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
  });

  const results = [];
  for (const url of urls) {
    const page = await context.newPage();
    await page.goto(url, { waitUntil: 'networkidle' });

    // Wait for Cloudflare challenge to resolve
    await page.waitForFunction(() => {
      return !document.querySelector('#challenge-running');
    }, { timeout: 30000 }).catch(() => {});

    const data = await page.evaluate(() => {
      return {
        title: document.title,
        content: document.querySelector('main')?.innerHTML,
        links: Array.from(document.querySelectorAll('a[href]')).map(a => a.href)
      };
    });

    results.push({ url, ...data });
    await page.close();
    await new Promise(r => setTimeout(r, 2000));
  }

  await browser.close();
  return results;
}

Configuration

Option	Description	Default
`base_url`	Target website base URL	Required
`request_delay`	Seconds between requests	`2.0`
`max_retries`	Maximum retry attempts per page	`3`
`timeout`	Request timeout in seconds	`30`
`user_agent`	Browser user agent string	Chrome latest
`use_browser`	Use real browser vs HTTP client	`false`
`respect_robots`	Honor robots.txt directives	`true`
`max_concurrent`	Maximum concurrent requests	`1`

Best Practices

Always check for and respect robots.txt and terms of service before crawling any website — unauthorized scraping can result in IP bans, legal action, and service disruption for other users
Implement exponential backoff for retries rather than fixed delays, doubling the wait time after each failed attempt to reduce load on the target server during issues
Rotate user agents and maintain session cookies to mimic natural browsing patterns; using a single fixed user agent for thousands of requests is an obvious bot signal
Cache responses locally to avoid re-fetching pages you have already processed during development and debugging — this reduces load on the target and speeds up your iteration cycle
Use structured extraction with CSS selectors rather than regex on HTML, and validate extracted data against expected schemas to catch parsing issues early when the site's structure changes

Common Issues

Cloudflare JS challenge blocking requests: The 5-second challenge page requires JavaScript execution that plain HTTP clients cannot handle. Use cloudscraper for automatic challenge solving, or switch to browser automation with Playwright when cloudscraper fails on newer challenge versions.

Rate limiting causing incomplete crawls: Aggressive crawling triggers Cloudflare's rate limiter, returning 429 errors. Read the Retry-After header to determine the exact wait time, implement per-domain rate limiters, and consider running crawls during off-peak hours when rate limits may be more lenient.

Session cookies expiring mid-crawl: Cloudflare cookies have short lifetimes and need refreshing. Monitor for 403 responses during a crawl session, and when detected, revisit the site's homepage to obtain fresh challenge-clearance cookies before resuming the crawl.

⚠️ Loading Issue

Advanced Cf Crawl

Cloudflare Crawl

When to Use

Quick Start

Core Concepts

Cloudflare Protection Levels

Browser-Based Crawling with Playwright

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace