F

Firecrawl Scraper System

Boost productivity using this deep, scraping, screenshots, parsing. Includes structured workflows, validation checks, and reusable patterns for web development.

SkillClipticsweb developmentv1.0.0MIT
0 views0 copies

Firecrawl Scraper System

A web scraping skill leveraging the Firecrawl API to convert any website into clean, LLM-ready markdown or structured data, handling JavaScript rendering, anti-bot protections, and complex site structures automatically.

When to Use

Choose Firecrawl Scraper when:

  • Converting web pages to clean markdown for LLM consumption or RAG pipelines
  • Crawling entire websites with automatic sitemap following and page discovery
  • Scraping JavaScript-heavy single-page applications that fail with simple HTTP requests
  • Extracting structured data from web pages using LLM-powered extraction schemas

Consider alternatives when:

  • Scraping simple static HTML pages — use requests + BeautifulSoup for lower cost
  • Building a search engine — use dedicated web indexing infrastructure
  • Monitoring specific page changes — use a change detection service

Quick Start

# Install Firecrawl SDK pip install firecrawl-py # or npm install @mendable/firecrawl-js
from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="YOUR_API_KEY") # Scrape a single page to markdown result = app.scrape_url("https://example.com/blog/post", { "formats": ["markdown", "html"] }) print(result["markdown"]) # Crawl an entire website crawl_result = app.crawl_url("https://docs.example.com", { "limit": 100, "scrapeOptions": { "formats": ["markdown"] } }) for page in crawl_result["data"]: print(f"URL: {page['metadata']['url']}") print(f"Title: {page['metadata']['title']}") print(f"Content: {page['markdown'][:200]}...") # Extract structured data with schema extract_result = app.scrape_url("https://example.com/product", { "formats": ["extract"], "extract": { "schema": { "type": "object", "properties": { "product_name": {"type": "string"}, "price": {"type": "number"}, "features": {"type": "array", "items": {"type": "string"}}, "rating": {"type": "number"} } } } }) print(extract_result["extract"])

Core Concepts

Firecrawl Capabilities

FeatureDescriptionUse Case
ScrapeSingle page to markdown/HTMLContent extraction
CrawlMulti-page with link followingSite-wide indexing
MapDiscover all URLs on a siteSitemap generation
ExtractLLM-powered structured extractionData collection
ScreenshotCapture page screenshotsVisual monitoring

Building a RAG Pipeline

class FirecrawlRAGPipeline: def __init__(self, api_key): self.app = FirecrawlApp(api_key=api_key) self.documents = [] def ingest_site(self, url, max_pages=50): """Crawl a site and prepare documents for RAG""" result = self.app.crawl_url(url, { "limit": max_pages, "scrapeOptions": { "formats": ["markdown"], "onlyMainContent": True } }) for page in result.get("data", []): self.documents.append({ "url": page["metadata"]["url"], "title": page["metadata"].get("title", ""), "content": page.get("markdown", ""), "word_count": len(page.get("markdown", "").split()) }) return len(self.documents) def chunk_documents(self, chunk_size=1000, overlap=200): """Split documents into chunks for embedding""" chunks = [] for doc in self.documents: words = doc["content"].split() for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i:i + chunk_size]) chunks.append({ "text": chunk, "source_url": doc["url"], "source_title": doc["title"] }) return chunks

Configuration

OptionDescriptionDefault
api_keyFirecrawl API keyRequired
formatsOutput formats: markdown, html, extract["markdown"]
onlyMainContentStrip navigation, footer, adstrue
limitMaximum pages to crawl100
includePathsURL patterns to include in crawl[]
excludePathsURL patterns to exclude from crawl[]
maxDepthMaximum crawl depth from start URL5
waitForWait time for JS rendering (ms)0

Best Practices

  1. Use onlyMainContent: true to strip navigation, headers, footers, and sidebars — this produces cleaner markdown that is more useful for LLM processing and reduces token usage significantly
  2. Set includePaths for targeted crawling to avoid crawling irrelevant pages like login forms, terms of service, and admin pages that waste API credits and pollute your dataset
  3. Use the extract feature with JSON schemas for structured data collection instead of parsing markdown manually — Firecrawl's LLM-powered extraction handles varied page layouts much better than regex
  4. Implement rate limiting and pagination for large crawls by setting reasonable limits and processing results in batches to stay within API rate limits and memory constraints
  5. Cache crawl results locally with timestamps so you can re-process data without re-crawling, and only re-crawl pages that have changed since the last crawl

Common Issues

JavaScript-rendered content not captured: Some SPAs load content after initial page load via API calls. Use the waitFor parameter to give the page time to render, or specify actions like scrolling to trigger lazy-loaded content before extraction.

Crawl exceeding API credits unexpectedly: Websites with many internal links can have thousands of pages. Set strict limit values, use includePaths to constrain crawling to specific sections, and monitor credit usage through the Firecrawl dashboard.

Markdown quality varies across pages: Different page layouts produce markdown of varying quality. Filter out pages with very short content (under 100 words), use onlyMainContent to remove boilerplate, and post-process markdown to standardize heading levels and remove broken image references.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates