Building Agentic Web Scrapers That Scale to Millions of Records

Traditional scrapers break. They break when a CSS selector changes, when JavaScript renders content dynamically, and when anti-bot systems upgrade. After building scraping pipelines that have collectively processed millions of records for clients, I moved to an agentic architecture — and it changed what's possible.

Traditional vs Agentic: What's the Difference?

A traditional scraper is a rigid script: navigate to URL, find selector, extract value, repeat. An agentic scraper is goal-directed: you give it an objective and it determines the steps.

Scenario	Traditional Scraper	Agentic Scraper
CSS selector changes	Breaks immediately	Adapts via semantic extraction
JavaScript-rendered content	Needs Playwright workaround	Handled natively
Novel page structure	Fails	LLM reads it
Pagination changes	Breaks	Agent discovers new pattern
Scale to 1M+ records	Manual monitoring required	Checkpoint + auto-resume

Architecture: Three Agents, One Pipeline

Here's the architecture I use across client projects:

Orchestrator Agent (LangChain + Claude/GPT)
    ├── Browser Agent (Playwright)
    │   ├── Navigation
    │   ├── Interaction (clicks, forms, infinite scroll)
    │   └── DOM snapshot for extraction
    └── Extraction Agent (LLM)
        ├── Semantic field extraction
        ├── Schema validation (Pydantic)
        └── Deduplication

The Orchestrator Agent

The orchestrator is a LangChain agent that decides what to do next. It has tools it can call — each tool wraps a browser action or data operation:

from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_anthropic import ChatAnthropic
from langchain.tools import tool

llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")

@tool
def navigate(url: str) -> str:
    """Navigate browser to a URL. Returns page title."""
    page.goto(url, wait_until="networkidle")
    return page.title()

@tool
def get_links(contains: str) -> list[str]:
    """Get all href links containing a substring."""
    els = page.query_selector_all(f'a[href*="{contains}"]')
    return [el.get_attribute("href") for el in els]

@tool
def extract_data(fields: list[str]) -> dict:
    """Extract named fields from the current page using LLM."""
    html = page.inner_text("body")
    return semantic_extract(html, fields)

@tool
def find_next_page() -> str | None:
    """Find next pagination URL, or None if last page."""
    for sel in ['a[aria-label="Next"]', 'a:has-text("Next")',
                'button:has-text("Load more")']:
        el = page.query_selector(sel)
        if el:
            return el.get_attribute("href") or "__click__"
    return None

Browser Agent with Playwright

The browser context needs to look like a real user to avoid bot detection:

from playwright.async_api import async_playwright

async def make_browser():
    pw = await async_playwright().start()
    browser = await pw.chromium.launch(
        headless=True,
        args=["--disable-blink-features=AutomationControlled"],
    )
    ctx = await browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
    )
    # Stealth: hide navigator.webdriver
    await ctx.add_init_script(
        "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
    )
    return ctx

Semantic Extraction — No Selectors Required

This is the part that makes agentic scrapers durable. Instead of CSS selectors that break when a site redesigns, you pass the page text to an LLM and ask it to find the data:

import trafilatura
import json
from pydantic import BaseModel

class CompanyProfile(BaseModel):
    name: str
    website: str | None
    email: str | None
    employee_count: str | None
    description: str | None

async def semantic_extract(html: str, schema: type[BaseModel]) -> dict:
    # Clean HTML to text first (removes nav, ads, boilerplate)
    text = trafilatura.extract(html, include_tables=True) or html[:6000]

    prompt = f"""Extract the following data from this webpage.
Return valid JSON matching this schema exactly.
If a field is not found, return null for it.

Schema fields: {list(schema.model_fields.keys())}

Webpage content:
{text[:8000]}"""

    response = await llm.ainvoke(prompt)
    data = json.loads(response.content)
    return schema(**data).model_dump()

When the site redesigns their layout next month, this still works. The selector-based scraper breaks.

Fault Tolerance at Scale

The real engineering challenge isn't scraping 10 pages — it's scraping 500,000 reliably. Three things you must have:

1. Checkpoint / Resume

import sqlite3, json

class Checkpoint:
    def __init__(self, path: str):
        self.db = sqlite3.connect(path)
        self.db.execute(
            "CREATE TABLE IF NOT EXISTS jobs "
            "(url TEXT PRIMARY KEY, status TEXT, data TEXT, ts DATETIME DEFAULT CURRENT_TIMESTAMP)"
        )

    def done(self, url: str, data: dict):
        self.db.execute(
            "INSERT OR REPLACE INTO jobs VALUES (?,?,?,CURRENT_TIMESTAMP)",
            (url, "done", json.dumps(data))
        )
        self.db.commit()

    def pending(self, urls: list[str]) -> list[str]:
        done = {r[0] for r in self.db.execute(
            "SELECT url FROM jobs WHERE status='done'"
        )}
        return [u for u in urls if u not in done]

2. Proxy Rotation

import random

class ProxyPool:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.failures: dict[str, int] = {}

    def get(self) -> str:
        available = [p for p in self.proxies
                     if self.failures.get(p, 0) < 3]
        if not available:
            raise RuntimeError("All proxies exhausted")
        return random.choice(available)

    def fail(self, proxy: str):
        self.failures[proxy] = self.failures.get(proxy, 0) + 1

3. Respectful Rate Limiting

import asyncio, random

async def human_delay(min_s=1.5, max_s=4.0):
    """Mimics human reading time between requests."""
    await asyncio.sleep(random.uniform(min_s, max_s))

# For heavier sites, add jitter based on page complexity
async def adaptive_delay(page_text_len: int):
    base = 1.5 + (page_text_len / 10000)  # longer pages = longer "read" time
    await asyncio.sleep(min(base, 5.0))

Real-World Results

On a B2B directory scraping project:

—2.4 million records extracted over 6 days
—99.2% success rate with checkpoint/resume
—$47 total proxy cost for the full run
—Zero manual intervention after initial launch
—Client had been paying a data vendor $800/month for a fraction of this data

When to Use Agentic vs Traditional

Scenario	Best Approach
Static HTML, stable structure	Traditional (BeautifulSoup)
JavaScript-heavy, stable structure	Playwright + selectors
Frequently changing layout	Agentic (LLM extraction)
Millions of records, long-running	Agentic + checkpointing
Novel pages, unknown structure	Agentic (only option)

Frequently Asked Questions

Is web scraping legal?

Legality varies by jurisdiction and the site's Terms of Service. Always check robots.txt, respect rate limits, and only collect publicly available data. I build scrapers that target publicly accessible data and are designed to avoid overloading target servers.

Which LLM works best for extraction?

Claude 3.5 Sonnet for accuracy-critical extraction. GPT-4o-mini for high-volume cost efficiency. For structured extraction with a clear Pydantic schema, both perform similarly — the schema does most of the work.

How do you handle CAPTCHAs?

For human-solvable CAPTCHAs I use 2captcha or anti-captcha API integrations. For Cloudflare Turnstile, undetected-chromedriver + proper browser fingerprinting handles most cases. Some targets are simply not worth the engineering effort.

Can this run on a cheap VPS?

Yes. Small jobs run fine on a $6/month VPS. For large distributed scrapes I use Celery + Redis for task queuing, with multiple workers spread across different IP ranges.