Firecrawl Scraper System

A web scraping skill leveraging the Firecrawl API to convert any website into clean, LLM-ready markdown or structured data, handling JavaScript rendering, anti-bot protections, and complex site structures automatically.

When to Use

Choose Firecrawl Scraper when:

Converting web pages to clean markdown for LLM consumption or RAG pipelines
Crawling entire websites with automatic sitemap following and page discovery
Scraping JavaScript-heavy single-page applications that fail with simple HTTP requests
Extracting structured data from web pages using LLM-powered extraction schemas

Consider alternatives when:

Scraping simple static HTML pages — use requests + BeautifulSoup for lower cost
Building a search engine — use dedicated web indexing infrastructure
Monitoring specific page changes — use a change detection service

Quick Start


# Install Firecrawl SDK
pip install firecrawl-py
# or
npm install @mendable/firecrawl-js


from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="YOUR_API_KEY")

# Scrape a single page to markdown
result = app.scrape_url("https://example.com/blog/post", {
    "formats": ["markdown", "html"]
})
print(result["markdown"])

# Crawl an entire website
crawl_result = app.crawl_url("https://docs.example.com", {
    "limit": 100,
    "scrapeOptions": {
        "formats": ["markdown"]
    }
})

for page in crawl_result["data"]:
    print(f"URL: {page['metadata']['url']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content: {page['markdown'][:200]}...")

# Extract structured data with schema
extract_result = app.scrape_url("https://example.com/product", {
    "formats": ["extract"],
    "extract": {
        "schema": {
            "type": "object",
            "properties": {
                "product_name": {"type": "string"},
                "price": {"type": "number"},
                "features": {"type": "array", "items": {"type": "string"}},
                "rating": {"type": "number"}
            }
        }
    }
})
print(extract_result["extract"])

Core Concepts

Firecrawl Capabilities

Feature	Description	Use Case
Scrape	Single page to markdown/HTML	Content extraction
Crawl	Multi-page with link following	Site-wide indexing
Map	Discover all URLs on a site	Sitemap generation
Extract	LLM-powered structured extraction	Data collection
Screenshot	Capture page screenshots	Visual monitoring

Building a RAG Pipeline


class FirecrawlRAGPipeline:
    def __init__(self, api_key):
        self.app = FirecrawlApp(api_key=api_key)
        self.documents = []

    def ingest_site(self, url, max_pages=50):
        """Crawl a site and prepare documents for RAG"""
        result = self.app.crawl_url(url, {
            "limit": max_pages,
            "scrapeOptions": {
                "formats": ["markdown"],
                "onlyMainContent": True
            }
        })

        for page in result.get("data", []):
            self.documents.append({
                "url": page["metadata"]["url"],
                "title": page["metadata"].get("title", ""),
                "content": page.get("markdown", ""),
                "word_count": len(page.get("markdown", "").split())
            })

        return len(self.documents)

    def chunk_documents(self, chunk_size=1000, overlap=200):
        """Split documents into chunks for embedding"""
        chunks = []
        for doc in self.documents:
            words = doc["content"].split()
            for i in range(0, len(words), chunk_size - overlap):
                chunk = " ".join(words[i:i + chunk_size])
                chunks.append({
                    "text": chunk,
                    "source_url": doc["url"],
                    "source_title": doc["title"]
                })
        return chunks

Configuration

Option	Description	Default
`api_key`	Firecrawl API key	Required
`formats`	Output formats: markdown, html, extract	`["markdown"]`
`onlyMainContent`	Strip navigation, footer, ads	`true`
`limit`	Maximum pages to crawl	`100`
`includePaths`	URL patterns to include in crawl	`[]`
`excludePaths`	URL patterns to exclude from crawl	`[]`
`maxDepth`	Maximum crawl depth from start URL	`5`
`waitFor`	Wait time for JS rendering (ms)	`0`

Best Practices

Use onlyMainContent: true to strip navigation, headers, footers, and sidebars — this produces cleaner markdown that is more useful for LLM processing and reduces token usage significantly
Set includePaths for targeted crawling to avoid crawling irrelevant pages like login forms, terms of service, and admin pages that waste API credits and pollute your dataset
Use the extract feature with JSON schemas for structured data collection instead of parsing markdown manually — Firecrawl's LLM-powered extraction handles varied page layouts much better than regex
Implement rate limiting and pagination for large crawls by setting reasonable limits and processing results in batches to stay within API rate limits and memory constraints
Cache crawl results locally with timestamps so you can re-process data without re-crawling, and only re-crawl pages that have changed since the last crawl

Common Issues

JavaScript-rendered content not captured: Some SPAs load content after initial page load via API calls. Use the waitFor parameter to give the page time to render, or specify actions like scrolling to trigger lazy-loaded content before extraction.

Crawl exceeding API credits unexpectedly: Websites with many internal links can have thousands of pages. Set strict limit values, use includePaths to constrain crawling to specific sections, and monitor credit usage through the Firecrawl dashboard.

Markdown quality varies across pages: Different page layouts produce markdown of varying quality. Filter out pages with very short content (under 100 words), use onlyMainContent to remove boilerplate, and post-process markdown to standardize heading levels and remove broken image references.

⚠️ Loading Issue

Firecrawl Scraper System

Firecrawl Scraper System

When to Use

Quick Start

Core Concepts

Firecrawl Capabilities

Building a RAG Pipeline

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace