Specialist Apify Ally
Battle-tested agent for expert, agent, integrating, apify. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.
Specialist Apify Ally
A web scraping and automation specialist that helps you build, deploy, and manage Apify actors for data extraction, web crawling, and browser automation at scale.
When to Use This Agent
Choose Specialist Apify Ally when:
- Building web scrapers and crawlers using the Apify platform
- Deploying actors for scheduled data extraction jobs
- Implementing browser automation with Puppeteer or Playwright on Apify
- Managing Apify datasets, key-value stores, and request queues
- Scaling scraping operations with proxy rotation and rate limiting
Consider alternatives when:
- Writing simple one-off scraping scripts (use a general development agent)
- Testing web applications with Playwright (use Specialist Playwright Ally)
- Building API integrations without web scraping (use a backend development agent)
Quick Start
# .claude/agents/specialist-apify-ally.yml name: Specialist Apify Ally description: Build and manage Apify web scraping actors model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep - WebSearch
Example invocation:
claude "Build an Apify actor that scrapes product listings from an e-commerce site, handles pagination, extracts prices and reviews, and stores results in a dataset"
Core Concepts
Apify Actor Architecture
| Component | Purpose | Example |
|---|---|---|
| Actor | Serverless function for scraping | Product scraper |
| Dataset | Structured data storage | Extracted product records |
| Key-Value Store | Arbitrary data storage | Screenshots, state, configs |
| Request Queue | URL management | Pages to crawl |
| Proxy | IP rotation and geo-targeting | Residential proxy pool |
Crawlee-Based Actor
// src/main.ts — Apify actor with Crawlee import { Actor } from 'apify'; import { CheerioCrawler, Dataset } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ maxConcurrency: 10, maxRequestRetries: 3, requestHandlerTimeoutSecs: 30, async requestHandler({ request, $, enqueueLinks, log }) { const url = request.url; if (url.includes('/products/')) { // Product detail page const product = { url, title: $('h1.product-title').text().trim(), price: parseFloat($('.price').text().replace(/[^0-9.]/g, '')), rating: parseFloat($('.rating').attr('data-score') || '0'), reviews: parseInt($('.review-count').text()) || 0, description: $('.description').text().trim(), images: $('img.product-image').map((_, el) => $(el).attr('src')).get(), scrapedAt: new Date().toISOString(), }; await Dataset.pushData(product); log.info(`Scraped: ${product.title} — $${product.price}`); } else { // Listing page — enqueue product links and next pages await enqueueLinks({ selector: 'a.product-link', label: 'PRODUCT', }); await enqueueLinks({ selector: 'a.next-page', label: 'LISTING', }); } }, }); const input = await Actor.getInput(); await crawler.run(input?.startUrls || ['https://example.com/products']); await Actor.exit();
Proxy and Rate Limiting
import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); const crawler = new PlaywrightCrawler({ proxyConfiguration, maxConcurrency: 5, navigationTimeoutSecs: 60, maxRequestRetries: 5, // Rate limiting maxRequestsPerMinute: 30, async requestHandler({ page, request, log }) { // Wait for dynamic content await page.waitForSelector('.product-grid', { timeout: 10000 }); // Extract data... }, async failedRequestHandler({ request, log }) { log.error(`Failed: ${request.url} after ${request.retryCount} retries`); }, });
Configuration
| Parameter | Description | Default |
|---|---|---|
crawler_type | Crawler engine (cheerio, playwright, puppeteer) | cheerio |
max_concurrency | Maximum parallel requests | 10 |
proxy_group | Proxy pool (datacenter, residential) | datacenter |
max_retries | Retry count for failed requests | 3 |
rate_limit | Maximum requests per minute | 60 |
storage_type | Output storage (dataset, key-value, both) | dataset |
Best Practices
-
Start with CheerioCrawler and only upgrade to PlaywrightCrawler when needed. CheerioCrawler (HTML parsing without a browser) is 10-50x faster and uses 10x less memory than browser-based crawlers. Only switch to PlaywrightCrawler when the target site requires JavaScript rendering, client-side data loading, or interaction (clicking, scrolling). Many "JavaScript-rendered" sites have API endpoints that return the data directly.
-
Implement request deduplication and state persistence for large crawls. Crawling millions of pages over hours or days requires handling actor restarts. Use Apify's built-in request queue for URL deduplication and the key-value store to checkpoint progress. Design the actor so it can resume from where it stopped without re-crawling already-processed pages.
-
Validate extracted data before storing it. Scraping is fragile — HTML structure changes break selectors silently, returning empty or incorrect data. Validate every extracted field: check that prices are positive numbers, URLs are valid, and required fields are non-empty. Log warnings for validation failures and skip invalid records rather than storing garbage data.
-
Respect robots.txt and implement polite crawling patterns. Set appropriate delays between requests, respect rate limits, and honor robots.txt directives. Aggressive crawling gets IP addresses blocked, APIs rate-limited, and accounts banned. A sustainable crawling speed that runs indefinitely is better than a fast crawl that gets blocked after 1,000 pages.
-
Use proxy rotation for targets with anti-bot protection. Rotate IP addresses across requests using proxy pools. Use residential proxies for heavily protected sites and datacenter proxies for lenient targets. Rotate user agents alongside IP addresses. Configure session-based proxy assignment where the same IP handles all requests for one session to avoid triggering suspicious activity detection.
Common Issues
Scraped data is empty because content is loaded via JavaScript after page render. CheerioCrawler only sees the initial HTML response, which may be an empty shell with JavaScript that loads the actual content. Switch to PlaywrightCrawler and add await page.waitForSelector('.data-container') to wait for dynamic content. Alternatively, inspect the network tab for XHR/fetch requests that return the data as JSON — calling the API directly is faster than rendering the page.
Actor runs out of memory on large crawls. Storing millions of URL entries in the request queue or accumulating large datasets in memory causes out-of-memory crashes. Use Apify's persistent request queue (stored externally, not in memory). Push data to the dataset frequently instead of accumulating it. Set maxRequestsPerCrawl to limit batch size and chain multiple actor runs for very large crawls.
Selectors break when the target site updates its HTML structure. Web scraping is inherently fragile. Minimize breakage by using stable selectors (data attributes, ARIA roles, semantic elements) rather than CSS class names that change with each deploy. Implement monitoring that alerts when extraction yields zero results or unusual patterns. Keep selector definitions separate from crawling logic so they can be updated without modifying the crawler architecture.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.