P

Precision Web Scraping MCP Bridge

All-in-one mcp for managing scrape, crawl, and map websites through MCP. Built for Claude Code with best practices and real-world patterns.

MCPCommunitydevelopmentv1.0.0MIT
0 views0 copies

Precision Web Scraping MCP Bridge

Precision Web Scraping MCP Bridge is an MCP server designed for targeted, high-accuracy web data extraction, providing AI assistants with advanced scraping capabilities that handle complex page structures, dynamic content, and structured data extraction patterns. This MCP bridge goes beyond basic fetching by offering CSS selector targeting, XPath queries, pagination handling, and data normalization features that enable precise extraction of specific data points from web pages.

When to Use This MCP Server

Connect this server when...

  • You need to extract specific data points from web pages using CSS selectors or XPath expressions with high precision
  • Your workflow involves scraping structured data like product listings, price tables, or directory entries from websites
  • You want to handle multi-page scraping with automatic pagination detection and following
  • You need to extract data from pages with complex layouts where simple text extraction produces messy results
  • You are building data collection pipelines that need consistent, structured output from diverse web sources

Consider alternatives when...

  • You only need to read article content or documentation (use a content reading MCP server)
  • Your scraping target has an official API that provides the data in structured format
  • You need real-time data streaming rather than on-demand page scraping

Quick Start

# .mcp.json configuration { "mcpServers": { "web-scraper": { "command": "npx", "args": ["-y", "@mcp/precision-web-scraper"], "env": { "HEADLESS": "true", "RESPECT_ROBOTS": "true" } } } }

Connection setup:

  1. Ensure Node.js 18+ is installed on your system
  2. The server requires Chromium for JavaScript-rendered page scraping
  3. Add the configuration above to your .mcp.json file
  4. Restart your MCP client to activate the web scraper

Example tool usage:

# Extract product data
> Scrape all product names and prices from the category page at https://example.com/products

# Use CSS selectors
> Extract all elements matching ".review-card .rating" from the reviews page

# Handle pagination
> Scrape all job listings from the careers page, following pagination to get all results

Core Concepts

ConceptPurposeDetails
CSS SelectorsElement targetingPrecise targeting of page elements using CSS selector syntax for clean data extraction
XPath QueriesAdvanced selectionXML path expressions for complex element selection including parent, sibling, and attribute traversal
Pagination HandlingMulti-page extractionAutomatic detection and following of pagination links to scrape data spanning multiple pages
Data NormalizationOutput consistencyCleaning and standardizing extracted data into consistent formats (JSON, CSV, Markdown tables)
Rate ControlResponsible scrapingConfigurable request delays and concurrent request limits to avoid overloading target servers
Architecture:

+------------------+       +------------------+       +------------------+
|  Target          |       |  Scraper MCP     |       |  AI Assistant    |
|  Websites        |<----->|  Bridge (npx)    |<----->|  (Claude, etc.)  |
|  (Internet)      | HTTP  |  + Headless      | stdio |                  |
|                  |       |  Browser + Parser |       |                  |
+------------------+       +------------------+       +------------------+
        |
        v
+------------------------------------------------------+
|  Fetch > Render > Select > Extract > Normalize        |
+------------------------------------------------------+

Configuration

ParameterTypeDefaultDescription
HEADLESSbooleantrueRun the browser engine in headless mode for server environments
RESPECT_ROBOTSbooleantrueHonor robots.txt directives when scraping target websites
request_delayinteger1000Delay in milliseconds between consecutive requests to the same domain
max_pagesinteger50Maximum number of pages to follow during paginated scraping operations
output_formatstringjsonDefault output format for extracted data (json, csv, markdown)

Best Practices

  1. Start with specific selectors rather than broad extraction. Define precise CSS selectors or XPath expressions for the data you need rather than scraping entire pages. Targeted extraction produces cleaner data and reduces the amount of post-processing needed to isolate useful information.

  2. Test selectors on a single page before pagination. Before enabling multi-page scraping, verify your extraction selectors work correctly on a single page. Incorrect selectors applied across dozens of pages waste time and API resources while producing unusable data.

  3. Respect rate limits and robots.txt. Keep RESPECT_ROBOTS enabled and configure appropriate request_delay values. Responsible scraping maintains your IP reputation and avoids legal issues. Most websites allow reasonable automated access but block aggressive scrapers.

  4. Handle dynamic content with wait strategies. Pages that load data asynchronously need time for content to render before extraction. Configure appropriate wait conditions to ensure the target elements are present in the DOM before attempting to extract their content.

  5. Normalize extracted data for consistent downstream processing. Use the server's data normalization features to standardize formats, clean whitespace, and structure extracted data consistently. This is especially important when scraping from multiple sources that use different formatting conventions.

Common Issues

Selectors return empty results on JavaScript-rendered pages. If the target content is loaded dynamically through JavaScript, the HTML source may not contain the elements you are targeting. Ensure the headless browser is enabled and allow sufficient render time for the JavaScript to execute and populate the DOM.

Pagination scraping stops prematurely. The server may not detect the pagination pattern automatically. Check whether the pagination uses standard next/previous links, numbered pages, or infinite scrolling. For non-standard pagination, provide explicit pagination selectors to guide the scraper.

Extracted data contains HTML artifacts or noise. Some elements include hidden text, aria labels, or nested elements that appear in extracted text. Refine your selectors to target more specific child elements, or apply post-extraction cleaning to remove unwanted HTML artifacts from the output.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates