S

Specialist Apify Ally

Battle-tested agent for expert, agent, integrating, apify. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.

AgentClipticsdevops infrastructurev1.0.0MIT
0 views0 copies

Specialist Apify Ally

A web scraping and automation specialist that helps you build, deploy, and manage Apify actors for data extraction, web crawling, and browser automation at scale.

When to Use This Agent

Choose Specialist Apify Ally when:

  • Building web scrapers and crawlers using the Apify platform
  • Deploying actors for scheduled data extraction jobs
  • Implementing browser automation with Puppeteer or Playwright on Apify
  • Managing Apify datasets, key-value stores, and request queues
  • Scaling scraping operations with proxy rotation and rate limiting

Consider alternatives when:

  • Writing simple one-off scraping scripts (use a general development agent)
  • Testing web applications with Playwright (use Specialist Playwright Ally)
  • Building API integrations without web scraping (use a backend development agent)

Quick Start

# .claude/agents/specialist-apify-ally.yml name: Specialist Apify Ally description: Build and manage Apify web scraping actors model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep - WebSearch

Example invocation:

claude "Build an Apify actor that scrapes product listings from an e-commerce site, handles pagination, extracts prices and reviews, and stores results in a dataset"

Core Concepts

Apify Actor Architecture

ComponentPurposeExample
ActorServerless function for scrapingProduct scraper
DatasetStructured data storageExtracted product records
Key-Value StoreArbitrary data storageScreenshots, state, configs
Request QueueURL managementPages to crawl
ProxyIP rotation and geo-targetingResidential proxy pool

Crawlee-Based Actor

// src/main.ts — Apify actor with Crawlee import { Actor } from 'apify'; import { CheerioCrawler, Dataset } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ maxConcurrency: 10, maxRequestRetries: 3, requestHandlerTimeoutSecs: 30, async requestHandler({ request, $, enqueueLinks, log }) { const url = request.url; if (url.includes('/products/')) { // Product detail page const product = { url, title: $('h1.product-title').text().trim(), price: parseFloat($('.price').text().replace(/[^0-9.]/g, '')), rating: parseFloat($('.rating').attr('data-score') || '0'), reviews: parseInt($('.review-count').text()) || 0, description: $('.description').text().trim(), images: $('img.product-image').map((_, el) => $(el).attr('src')).get(), scrapedAt: new Date().toISOString(), }; await Dataset.pushData(product); log.info(`Scraped: ${product.title} — $${product.price}`); } else { // Listing page — enqueue product links and next pages await enqueueLinks({ selector: 'a.product-link', label: 'PRODUCT', }); await enqueueLinks({ selector: 'a.next-page', label: 'LISTING', }); } }, }); const input = await Actor.getInput(); await crawler.run(input?.startUrls || ['https://example.com/products']); await Actor.exit();

Proxy and Rate Limiting

import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); const crawler = new PlaywrightCrawler({ proxyConfiguration, maxConcurrency: 5, navigationTimeoutSecs: 60, maxRequestRetries: 5, // Rate limiting maxRequestsPerMinute: 30, async requestHandler({ page, request, log }) { // Wait for dynamic content await page.waitForSelector('.product-grid', { timeout: 10000 }); // Extract data... }, async failedRequestHandler({ request, log }) { log.error(`Failed: ${request.url} after ${request.retryCount} retries`); }, });

Configuration

ParameterDescriptionDefault
crawler_typeCrawler engine (cheerio, playwright, puppeteer)cheerio
max_concurrencyMaximum parallel requests10
proxy_groupProxy pool (datacenter, residential)datacenter
max_retriesRetry count for failed requests3
rate_limitMaximum requests per minute60
storage_typeOutput storage (dataset, key-value, both)dataset

Best Practices

  1. Start with CheerioCrawler and only upgrade to PlaywrightCrawler when needed. CheerioCrawler (HTML parsing without a browser) is 10-50x faster and uses 10x less memory than browser-based crawlers. Only switch to PlaywrightCrawler when the target site requires JavaScript rendering, client-side data loading, or interaction (clicking, scrolling). Many "JavaScript-rendered" sites have API endpoints that return the data directly.

  2. Implement request deduplication and state persistence for large crawls. Crawling millions of pages over hours or days requires handling actor restarts. Use Apify's built-in request queue for URL deduplication and the key-value store to checkpoint progress. Design the actor so it can resume from where it stopped without re-crawling already-processed pages.

  3. Validate extracted data before storing it. Scraping is fragile — HTML structure changes break selectors silently, returning empty or incorrect data. Validate every extracted field: check that prices are positive numbers, URLs are valid, and required fields are non-empty. Log warnings for validation failures and skip invalid records rather than storing garbage data.

  4. Respect robots.txt and implement polite crawling patterns. Set appropriate delays between requests, respect rate limits, and honor robots.txt directives. Aggressive crawling gets IP addresses blocked, APIs rate-limited, and accounts banned. A sustainable crawling speed that runs indefinitely is better than a fast crawl that gets blocked after 1,000 pages.

  5. Use proxy rotation for targets with anti-bot protection. Rotate IP addresses across requests using proxy pools. Use residential proxies for heavily protected sites and datacenter proxies for lenient targets. Rotate user agents alongside IP addresses. Configure session-based proxy assignment where the same IP handles all requests for one session to avoid triggering suspicious activity detection.

Common Issues

Scraped data is empty because content is loaded via JavaScript after page render. CheerioCrawler only sees the initial HTML response, which may be an empty shell with JavaScript that loads the actual content. Switch to PlaywrightCrawler and add await page.waitForSelector('.data-container') to wait for dynamic content. Alternatively, inspect the network tab for XHR/fetch requests that return the data as JSON — calling the API directly is faster than rendering the page.

Actor runs out of memory on large crawls. Storing millions of URL entries in the request queue or accumulating large datasets in memory causes out-of-memory crashes. Use Apify's persistent request queue (stored externally, not in memory). Push data to the dataset frequently instead of accumulating it. Set maxRequestsPerCrawl to limit batch size and chain multiple actor runs for very large crawls.

Selectors break when the target site updates its HTML structure. Web scraping is inherently fragile. Minimize breakage by using stable selectors (data attributes, ARIA roles, semantic elements) rather than CSS class names that change with each deploy. Implement monitoring that alerts when extraction yields zero results or unusual patterns. Keep selector definitions separate from crawling logic so they can be updated without modifying the crawler architecture.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates