Firecrawl Scraper System
Boost productivity using this deep, scraping, screenshots, parsing. Includes structured workflows, validation checks, and reusable patterns for web development.
Firecrawl Scraper System
A web scraping skill leveraging the Firecrawl API to convert any website into clean, LLM-ready markdown or structured data, handling JavaScript rendering, anti-bot protections, and complex site structures automatically.
When to Use
Choose Firecrawl Scraper when:
- Converting web pages to clean markdown for LLM consumption or RAG pipelines
- Crawling entire websites with automatic sitemap following and page discovery
- Scraping JavaScript-heavy single-page applications that fail with simple HTTP requests
- Extracting structured data from web pages using LLM-powered extraction schemas
Consider alternatives when:
- Scraping simple static HTML pages — use requests + BeautifulSoup for lower cost
- Building a search engine — use dedicated web indexing infrastructure
- Monitoring specific page changes — use a change detection service
Quick Start
# Install Firecrawl SDK pip install firecrawl-py # or npm install @mendable/firecrawl-js
from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="YOUR_API_KEY") # Scrape a single page to markdown result = app.scrape_url("https://example.com/blog/post", { "formats": ["markdown", "html"] }) print(result["markdown"]) # Crawl an entire website crawl_result = app.crawl_url("https://docs.example.com", { "limit": 100, "scrapeOptions": { "formats": ["markdown"] } }) for page in crawl_result["data"]: print(f"URL: {page['metadata']['url']}") print(f"Title: {page['metadata']['title']}") print(f"Content: {page['markdown'][:200]}...") # Extract structured data with schema extract_result = app.scrape_url("https://example.com/product", { "formats": ["extract"], "extract": { "schema": { "type": "object", "properties": { "product_name": {"type": "string"}, "price": {"type": "number"}, "features": {"type": "array", "items": {"type": "string"}}, "rating": {"type": "number"} } } } }) print(extract_result["extract"])
Core Concepts
Firecrawl Capabilities
| Feature | Description | Use Case |
|---|---|---|
| Scrape | Single page to markdown/HTML | Content extraction |
| Crawl | Multi-page with link following | Site-wide indexing |
| Map | Discover all URLs on a site | Sitemap generation |
| Extract | LLM-powered structured extraction | Data collection |
| Screenshot | Capture page screenshots | Visual monitoring |
Building a RAG Pipeline
class FirecrawlRAGPipeline: def __init__(self, api_key): self.app = FirecrawlApp(api_key=api_key) self.documents = [] def ingest_site(self, url, max_pages=50): """Crawl a site and prepare documents for RAG""" result = self.app.crawl_url(url, { "limit": max_pages, "scrapeOptions": { "formats": ["markdown"], "onlyMainContent": True } }) for page in result.get("data", []): self.documents.append({ "url": page["metadata"]["url"], "title": page["metadata"].get("title", ""), "content": page.get("markdown", ""), "word_count": len(page.get("markdown", "").split()) }) return len(self.documents) def chunk_documents(self, chunk_size=1000, overlap=200): """Split documents into chunks for embedding""" chunks = [] for doc in self.documents: words = doc["content"].split() for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i:i + chunk_size]) chunks.append({ "text": chunk, "source_url": doc["url"], "source_title": doc["title"] }) return chunks
Configuration
| Option | Description | Default |
|---|---|---|
api_key | Firecrawl API key | Required |
formats | Output formats: markdown, html, extract | ["markdown"] |
onlyMainContent | Strip navigation, footer, ads | true |
limit | Maximum pages to crawl | 100 |
includePaths | URL patterns to include in crawl | [] |
excludePaths | URL patterns to exclude from crawl | [] |
maxDepth | Maximum crawl depth from start URL | 5 |
waitFor | Wait time for JS rendering (ms) | 0 |
Best Practices
- Use
onlyMainContent: trueto strip navigation, headers, footers, and sidebars — this produces cleaner markdown that is more useful for LLM processing and reduces token usage significantly - Set
includePathsfor targeted crawling to avoid crawling irrelevant pages like login forms, terms of service, and admin pages that waste API credits and pollute your dataset - Use the extract feature with JSON schemas for structured data collection instead of parsing markdown manually — Firecrawl's LLM-powered extraction handles varied page layouts much better than regex
- Implement rate limiting and pagination for large crawls by setting reasonable limits and processing results in batches to stay within API rate limits and memory constraints
- Cache crawl results locally with timestamps so you can re-process data without re-crawling, and only re-crawl pages that have changed since the last crawl
Common Issues
JavaScript-rendered content not captured: Some SPAs load content after initial page load via API calls. Use the waitFor parameter to give the page time to render, or specify actions like scrolling to trigger lazy-loaded content before extraction.
Crawl exceeding API credits unexpectedly: Websites with many internal links can have thousands of pages. Set strict limit values, use includePaths to constrain crawling to specific sections, and monitor credit usage through the Firecrawl dashboard.
Markdown quality varies across pages: Different page layouts produce markdown of varying quality. Filter out pages with very short content (under 100 words), use onlyMainContent to remove boilerplate, and post-process markdown to standardize heading levels and remove broken image references.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.