Specialist Apify Ally

A web scraping and automation specialist that helps you build, deploy, and manage Apify actors for data extraction, web crawling, and browser automation at scale.

When to Use This Agent

Choose Specialist Apify Ally when:

Building web scrapers and crawlers using the Apify platform
Deploying actors for scheduled data extraction jobs
Implementing browser automation with Puppeteer or Playwright on Apify
Managing Apify datasets, key-value stores, and request queues
Scaling scraping operations with proxy rotation and rate limiting

Consider alternatives when:

Writing simple one-off scraping scripts (use a general development agent)
Testing web applications with Playwright (use Specialist Playwright Ally)
Building API integrations without web scraping (use a backend development agent)

Quick Start


# .claude/agents/specialist-apify-ally.yml
name: Specialist Apify Ally
description: Build and manage Apify web scraping actors
model: claude-sonnet
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
  - WebSearch

Example invocation:

claude "Build an Apify actor that scrapes product listings from an e-commerce site, handles pagination, extracts prices and reviews, and stores results in a dataset"

Core Concepts

Apify Actor Architecture

Component	Purpose	Example
Actor	Serverless function for scraping	Product scraper
Dataset	Structured data storage	Extracted product records
Key-Value Store	Arbitrary data storage	Screenshots, state, configs
Request Queue	URL management	Pages to crawl
Proxy	IP rotation and geo-targeting	Residential proxy pool

Crawlee-Based Actor


// src/main.ts — Apify actor with Crawlee
import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const crawler = new CheerioCrawler({
  maxConcurrency: 10,
  maxRequestRetries: 3,
  requestHandlerTimeoutSecs: 30,

  async requestHandler({ request, $, enqueueLinks, log }) {
    const url = request.url;

    if (url.includes('/products/')) {
      // Product detail page
      const product = {
        url,
        title: $('h1.product-title').text().trim(),
        price: parseFloat($('.price').text().replace(/[^0-9.]/g, '')),
        rating: parseFloat($('.rating').attr('data-score') || '0'),
        reviews: parseInt($('.review-count').text()) || 0,
        description: $('.description').text().trim(),
        images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
        scrapedAt: new Date().toISOString(),
      };

      await Dataset.pushData(product);
      log.info(`Scraped: ${product.title} — $${product.price}`);
    } else {
      // Listing page — enqueue product links and next pages
      await enqueueLinks({
        selector: 'a.product-link',
        label: 'PRODUCT',
      });
      await enqueueLinks({
        selector: 'a.next-page',
        label: 'LISTING',
      });
    }
  },
});

const input = await Actor.getInput();
await crawler.run(input?.startUrls || ['https://example.com/products']);

await Actor.exit();

Proxy and Rate Limiting


import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

const proxyConfiguration = await Actor.createProxyConfiguration({
  groups: ['RESIDENTIAL'],
  countryCode: 'US',
});

const crawler = new PlaywrightCrawler({
  proxyConfiguration,
  maxConcurrency: 5,
  navigationTimeoutSecs: 60,
  maxRequestRetries: 5,

  // Rate limiting
  maxRequestsPerMinute: 30,

  async requestHandler({ page, request, log }) {
    // Wait for dynamic content
    await page.waitForSelector('.product-grid', { timeout: 10000 });
    // Extract data...
  },

  async failedRequestHandler({ request, log }) {
    log.error(`Failed: ${request.url} after ${request.retryCount} retries`);
  },
});

Configuration

Parameter	Description	Default
`crawler_type`	Crawler engine (cheerio, playwright, puppeteer)	`cheerio`
`max_concurrency`	Maximum parallel requests	`10`
`proxy_group`	Proxy pool (datacenter, residential)	`datacenter`
`max_retries`	Retry count for failed requests	`3`
`rate_limit`	Maximum requests per minute	`60`
`storage_type`	Output storage (dataset, key-value, both)	`dataset`

Best Practices

Start with CheerioCrawler and only upgrade to PlaywrightCrawler when needed. CheerioCrawler (HTML parsing without a browser) is 10-50x faster and uses 10x less memory than browser-based crawlers. Only switch to PlaywrightCrawler when the target site requires JavaScript rendering, client-side data loading, or interaction (clicking, scrolling). Many "JavaScript-rendered" sites have API endpoints that return the data directly.
Implement request deduplication and state persistence for large crawls. Crawling millions of pages over hours or days requires handling actor restarts. Use Apify's built-in request queue for URL deduplication and the key-value store to checkpoint progress. Design the actor so it can resume from where it stopped without re-crawling already-processed pages.
Validate extracted data before storing it. Scraping is fragile — HTML structure changes break selectors silently, returning empty or incorrect data. Validate every extracted field: check that prices are positive numbers, URLs are valid, and required fields are non-empty. Log warnings for validation failures and skip invalid records rather than storing garbage data.
Respect robots.txt and implement polite crawling patterns. Set appropriate delays between requests, respect rate limits, and honor robots.txt directives. Aggressive crawling gets IP addresses blocked, APIs rate-limited, and accounts banned. A sustainable crawling speed that runs indefinitely is better than a fast crawl that gets blocked after 1,000 pages.
Use proxy rotation for targets with anti-bot protection. Rotate IP addresses across requests using proxy pools. Use residential proxies for heavily protected sites and datacenter proxies for lenient targets. Rotate user agents alongside IP addresses. Configure session-based proxy assignment where the same IP handles all requests for one session to avoid triggering suspicious activity detection.

Common Issues

Scraped data is empty because content is loaded via JavaScript after page render. CheerioCrawler only sees the initial HTML response, which may be an empty shell with JavaScript that loads the actual content. Switch to PlaywrightCrawler and add await page.waitForSelector('.data-container') to wait for dynamic content. Alternatively, inspect the network tab for XHR/fetch requests that return the data as JSON — calling the API directly is faster than rendering the page.

Actor runs out of memory on large crawls. Storing millions of URL entries in the request queue or accumulating large datasets in memory causes out-of-memory crashes. Use Apify's persistent request queue (stored externally, not in memory). Push data to the dataset frequently instead of accumulating it. Set maxRequestsPerCrawl to limit batch size and chain multiple actor runs for very large crawls.

Selectors break when the target site updates its HTML structure. Web scraping is inherently fragile. Minimize breakage by using stable selectors (data attributes, ARIA roles, semantic elements) rather than CSS class names that change with each deploy. Implement monitoring that alerts when extraction yields zero results or unusual patterns. Keep selector definitions separate from crawling logic so they can be updated without modifying the crawler architecture.

⚠️ Loading Issue

Specialist Apify Ally

Specialist Apify Ally

When to Use This Agent

Quick Start

Core Concepts

Apify Actor Architecture

Crawlee-Based Actor

Proxy and Rate Limiting

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner