Advanced Cf Crawl
Streamline your workflow with this crawl, entire, websites, using. Includes structured workflows, validation checks, and reusable patterns for utilities.
Cloudflare Crawl
A web crawling skill for extracting structured data from websites protected by Cloudflare's security features, including handling JavaScript challenges, rate limiting, and anti-bot protections.
When to Use
Choose Cloudflare Crawl when:
- Extracting data from websites behind Cloudflare's anti-bot protection
- Building crawlers that handle JavaScript challenges and turnstile verification
- Setting up rate-limited, respectful crawling pipelines with retry logic
- Converting dynamic web content to clean structured data
Consider alternatives when:
- The target site offers an API — always prefer official APIs over scraping
- Crawling static sites without protection — use simpler HTTP-based scrapers
- Needing real-time data — use WebSocket feeds or streaming APIs if available
Quick Start
# Install crawling dependencies pip install cloudscraper beautifulsoup4 requests-html npm install crawlee playwright
import cloudscraper from bs4 import BeautifulSoup import time import json class CloudflareCrawler: def __init__(self, base_url, delay=2.0): self.base_url = base_url self.delay = delay self.scraper = cloudscraper.create_scraper( browser={ 'browser': 'chrome', 'platform': 'darwin', 'desktop': True } ) self.session_data = {} def fetch_page(self, url, retries=3): """Fetch a page with Cloudflare bypass and retry logic""" for attempt in range(retries): try: response = self.scraper.get(url, timeout=30) if response.status_code == 200: return response.text elif response.status_code == 403: print(f"Blocked on attempt {attempt + 1}, waiting...") time.sleep(self.delay * (attempt + 1)) elif response.status_code == 429: wait = int(response.headers.get('Retry-After', 60)) print(f"Rate limited, waiting {wait}s...") time.sleep(wait) except Exception as e: print(f"Error: {e}") time.sleep(self.delay) return None def parse_listing(self, html): """Parse a listing page into structured data""" soup = BeautifulSoup(html, 'html.parser') items = [] for card in soup.select('.item-card'): items.append({ 'title': card.select_one('.title').get_text(strip=True), 'description': card.select_one('.desc').get_text(strip=True), 'url': card.select_one('a')['href'], 'metadata': { attr.get_text(strip=True).split(':')[0]: attr.get_text(strip=True).split(':')[1] for attr in card.select('.meta-item') if ':' in attr.get_text() } }) return items def crawl_paginated(self, path, max_pages=10): """Crawl paginated listings with rate limiting""" all_items = [] for page in range(1, max_pages + 1): url = f"{self.base_url}{path}?page={page}" html = self.fetch_page(url) if not html: break items = self.parse_listing(html) if not items: break all_items.extend(items) print(f"Page {page}: {len(items)} items") time.sleep(self.delay) return all_items
Core Concepts
Cloudflare Protection Levels
| Level | Challenge Type | Bypass Method | Difficulty |
|---|---|---|---|
| Under Attack | JS Challenge (5s page) | cloudscraper, browser automation | Medium |
| High Security | Managed Challenge | Browser with cookies | High |
| Turnstile | CAPTCHA-like challenge | Manual or solving service | Very High |
| Bot Fight Mode | ML-based detection | Real browser fingerprint | Very High |
| Rate Limiting | Request throttling | Respectful delays | Low |
Browser-Based Crawling with Playwright
import { chromium } from 'playwright'; async function crawlWithBrowser(urls: string[]) { const browser = await chromium.launch({ headless: false, // Some protections detect headless args: ['--disable-blink-features=AutomationControlled'] }); const context = await browser.newContext({ viewport: { width: 1920, height: 1080 }, userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' + 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' }); const results = []; for (const url of urls) { const page = await context.newPage(); await page.goto(url, { waitUntil: 'networkidle' }); // Wait for Cloudflare challenge to resolve await page.waitForFunction(() => { return !document.querySelector('#challenge-running'); }, { timeout: 30000 }).catch(() => {}); const data = await page.evaluate(() => { return { title: document.title, content: document.querySelector('main')?.innerHTML, links: Array.from(document.querySelectorAll('a[href]')).map(a => a.href) }; }); results.push({ url, ...data }); await page.close(); await new Promise(r => setTimeout(r, 2000)); } await browser.close(); return results; }
Configuration
| Option | Description | Default |
|---|---|---|
base_url | Target website base URL | Required |
request_delay | Seconds between requests | 2.0 |
max_retries | Maximum retry attempts per page | 3 |
timeout | Request timeout in seconds | 30 |
user_agent | Browser user agent string | Chrome latest |
use_browser | Use real browser vs HTTP client | false |
respect_robots | Honor robots.txt directives | true |
max_concurrent | Maximum concurrent requests | 1 |
Best Practices
- Always check for and respect robots.txt and terms of service before crawling any website — unauthorized scraping can result in IP bans, legal action, and service disruption for other users
- Implement exponential backoff for retries rather than fixed delays, doubling the wait time after each failed attempt to reduce load on the target server during issues
- Rotate user agents and maintain session cookies to mimic natural browsing patterns; using a single fixed user agent for thousands of requests is an obvious bot signal
- Cache responses locally to avoid re-fetching pages you have already processed during development and debugging — this reduces load on the target and speeds up your iteration cycle
- Use structured extraction with CSS selectors rather than regex on HTML, and validate extracted data against expected schemas to catch parsing issues early when the site's structure changes
Common Issues
Cloudflare JS challenge blocking requests: The 5-second challenge page requires JavaScript execution that plain HTTP clients cannot handle. Use cloudscraper for automatic challenge solving, or switch to browser automation with Playwright when cloudscraper fails on newer challenge versions.
Rate limiting causing incomplete crawls: Aggressive crawling triggers Cloudflare's rate limiter, returning 429 errors. Read the Retry-After header to determine the exact wait time, implement per-domain rate limiters, and consider running crawls during off-peak hours when rate limits may be more lenient.
Session cookies expiring mid-crawl: Cloudflare cookies have short lifetimes and need refreshing. Monitor for 403 responses during a crawl session, and when detected, revisit the site's homepage to obtain fresh challenge-clearance cookies before resuming the crawl.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.