Precision Web Scraping MCP Bridge
All-in-one mcp for managing scrape, crawl, and map websites through MCP. Built for Claude Code with best practices and real-world patterns.
Precision Web Scraping MCP Bridge
Precision Web Scraping MCP Bridge is an MCP server designed for targeted, high-accuracy web data extraction, providing AI assistants with advanced scraping capabilities that handle complex page structures, dynamic content, and structured data extraction patterns. This MCP bridge goes beyond basic fetching by offering CSS selector targeting, XPath queries, pagination handling, and data normalization features that enable precise extraction of specific data points from web pages.
When to Use This MCP Server
Connect this server when...
- You need to extract specific data points from web pages using CSS selectors or XPath expressions with high precision
- Your workflow involves scraping structured data like product listings, price tables, or directory entries from websites
- You want to handle multi-page scraping with automatic pagination detection and following
- You need to extract data from pages with complex layouts where simple text extraction produces messy results
- You are building data collection pipelines that need consistent, structured output from diverse web sources
Consider alternatives when...
- You only need to read article content or documentation (use a content reading MCP server)
- Your scraping target has an official API that provides the data in structured format
- You need real-time data streaming rather than on-demand page scraping
Quick Start
# .mcp.json configuration { "mcpServers": { "web-scraper": { "command": "npx", "args": ["-y", "@mcp/precision-web-scraper"], "env": { "HEADLESS": "true", "RESPECT_ROBOTS": "true" } } } }
Connection setup:
- Ensure Node.js 18+ is installed on your system
- The server requires Chromium for JavaScript-rendered page scraping
- Add the configuration above to your
.mcp.jsonfile - Restart your MCP client to activate the web scraper
Example tool usage:
# Extract product data
> Scrape all product names and prices from the category page at https://example.com/products
# Use CSS selectors
> Extract all elements matching ".review-card .rating" from the reviews page
# Handle pagination
> Scrape all job listings from the careers page, following pagination to get all results
Core Concepts
| Concept | Purpose | Details |
|---|---|---|
| CSS Selectors | Element targeting | Precise targeting of page elements using CSS selector syntax for clean data extraction |
| XPath Queries | Advanced selection | XML path expressions for complex element selection including parent, sibling, and attribute traversal |
| Pagination Handling | Multi-page extraction | Automatic detection and following of pagination links to scrape data spanning multiple pages |
| Data Normalization | Output consistency | Cleaning and standardizing extracted data into consistent formats (JSON, CSV, Markdown tables) |
| Rate Control | Responsible scraping | Configurable request delays and concurrent request limits to avoid overloading target servers |
Architecture:
+------------------+ +------------------+ +------------------+
| Target | | Scraper MCP | | AI Assistant |
| Websites |<----->| Bridge (npx) |<----->| (Claude, etc.) |
| (Internet) | HTTP | + Headless | stdio | |
| | | Browser + Parser | | |
+------------------+ +------------------+ +------------------+
|
v
+------------------------------------------------------+
| Fetch > Render > Select > Extract > Normalize |
+------------------------------------------------------+
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
| HEADLESS | boolean | true | Run the browser engine in headless mode for server environments |
| RESPECT_ROBOTS | boolean | true | Honor robots.txt directives when scraping target websites |
| request_delay | integer | 1000 | Delay in milliseconds between consecutive requests to the same domain |
| max_pages | integer | 50 | Maximum number of pages to follow during paginated scraping operations |
| output_format | string | json | Default output format for extracted data (json, csv, markdown) |
Best Practices
-
Start with specific selectors rather than broad extraction. Define precise CSS selectors or XPath expressions for the data you need rather than scraping entire pages. Targeted extraction produces cleaner data and reduces the amount of post-processing needed to isolate useful information.
-
Test selectors on a single page before pagination. Before enabling multi-page scraping, verify your extraction selectors work correctly on a single page. Incorrect selectors applied across dozens of pages waste time and API resources while producing unusable data.
-
Respect rate limits and robots.txt. Keep
RESPECT_ROBOTSenabled and configure appropriaterequest_delayvalues. Responsible scraping maintains your IP reputation and avoids legal issues. Most websites allow reasonable automated access but block aggressive scrapers. -
Handle dynamic content with wait strategies. Pages that load data asynchronously need time for content to render before extraction. Configure appropriate wait conditions to ensure the target elements are present in the DOM before attempting to extract their content.
-
Normalize extracted data for consistent downstream processing. Use the server's data normalization features to standardize formats, clean whitespace, and structure extracted data consistently. This is especially important when scraping from multiple sources that use different formatting conventions.
Common Issues
Selectors return empty results on JavaScript-rendered pages. If the target content is loaded dynamically through JavaScript, the HTML source may not contain the elements you are targeting. Ensure the headless browser is enabled and allow sufficient render time for the JavaScript to execute and populate the DOM.
Pagination scraping stops prematurely. The server may not detect the pagination pattern automatically. Check whether the pagination uses standard next/previous links, numbered pages, or infinite scrolling. For non-standard pagination, provide explicit pagination selectors to guide the scraper.
Extracted data contains HTML artifacts or noise. Some elements include hidden text, aria labels, or nested elements that appear in extracted text. Refine your selectors to target more specific child elements, or apply post-extraction cleaning to remove unwanted HTML artifacts from the output.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Database MCP Integration
MCP server configuration for connecting Claude Code to PostgreSQL, MySQL, and MongoDB databases. Enables schema inspection, query building, and migration generation.
Elevenlabs Server
Streamline your workflow with this official, elevenlabs, text, speech. Includes structured workflows, validation checks, and reusable patterns for audio.
Browser Use Portal
Powerful mcp for server, enables, agents, control. Includes structured workflows, validation checks, and reusable patterns for browser_automation.