Traditional scrapers break. They break when a CSS selector changes, when JavaScript renders content dynamically, and when anti-bot systems upgrade. After building scraping pipelines that have collectively processed millions of records for clients, I moved to an agentic architecture — and it changed what's possible.
Traditional vs Agentic: What's the Difference?
A traditional scraper is a rigid script: navigate to URL, find selector, extract value, repeat. An agentic scraper is goal-directed: you give it an objective and it determines the steps.
| Scenario | Traditional Scraper | Agentic Scraper |
|---|---|---|
| CSS selector changes | Breaks immediately | Adapts via semantic extraction |
| JavaScript-rendered content | Needs Playwright workaround | Handled natively |
| Novel page structure | Fails | LLM reads it |
| Pagination changes | Breaks | Agent discovers new pattern |
| Scale to 1M+ records | Manual monitoring required | Checkpoint + auto-resume |
Architecture: Three Agents, One Pipeline
Here's the architecture I use across client projects:
Orchestrator Agent (LangChain + Claude/GPT)
├── Browser Agent (Playwright)
│ ├── Navigation
│ ├── Interaction (clicks, forms, infinite scroll)
│ └── DOM snapshot for extraction
└── Extraction Agent (LLM)
├── Semantic field extraction
├── Schema validation (Pydantic)
└── DeduplicationThe Orchestrator Agent
The orchestrator is a LangChain agent that decides what to do next. It has tools it can call — each tool wraps a browser action or data operation:
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_anthropic import ChatAnthropic
from langchain.tools import tool
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")
@tool
def navigate(url: str) -> str:
"""Navigate browser to a URL. Returns page title."""
page.goto(url, wait_until="networkidle")
return page.title()
@tool
def get_links(contains: str) -> list[str]:
"""Get all href links containing a substring."""
els = page.query_selector_all(f'a[href*="{contains}"]')
return [el.get_attribute("href") for el in els]
@tool
def extract_data(fields: list[str]) -> dict:
"""Extract named fields from the current page using LLM."""
html = page.inner_text("body")
return semantic_extract(html, fields)
@tool
def find_next_page() -> str | None:
"""Find next pagination URL, or None if last page."""
for sel in ['a[aria-label="Next"]', 'a:has-text("Next")',
'button:has-text("Load more")']:
el = page.query_selector(sel)
if el:
return el.get_attribute("href") or "__click__"
return NoneBrowser Agent with Playwright
The browser context needs to look like a real user to avoid bot detection:
from playwright.async_api import async_playwright
async def make_browser():
pw = await async_playwright().start()
browser = await pw.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"],
)
ctx = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
)
# Stealth: hide navigator.webdriver
await ctx.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
)
return ctxSemantic Extraction — No Selectors Required
This is the part that makes agentic scrapers durable. Instead of CSS selectors that break when a site redesigns, you pass the page text to an LLM and ask it to find the data:
import trafilatura
import json
from pydantic import BaseModel
class CompanyProfile(BaseModel):
name: str
website: str | None
email: str | None
employee_count: str | None
description: str | None
async def semantic_extract(html: str, schema: type[BaseModel]) -> dict:
# Clean HTML to text first (removes nav, ads, boilerplate)
text = trafilatura.extract(html, include_tables=True) or html[:6000]
prompt = f"""Extract the following data from this webpage.
Return valid JSON matching this schema exactly.
If a field is not found, return null for it.
Schema fields: {list(schema.model_fields.keys())}
Webpage content:
{text[:8000]}"""
response = await llm.ainvoke(prompt)
data = json.loads(response.content)
return schema(**data).model_dump()When the site redesigns their layout next month, this still works. The selector-based scraper breaks.
Fault Tolerance at Scale
The real engineering challenge isn't scraping 10 pages — it's scraping 500,000 reliably. Three things you must have:
1. Checkpoint / Resume
import sqlite3, json
class Checkpoint:
def __init__(self, path: str):
self.db = sqlite3.connect(path)
self.db.execute(
"CREATE TABLE IF NOT EXISTS jobs "
"(url TEXT PRIMARY KEY, status TEXT, data TEXT, ts DATETIME DEFAULT CURRENT_TIMESTAMP)"
)
def done(self, url: str, data: dict):
self.db.execute(
"INSERT OR REPLACE INTO jobs VALUES (?,?,?,CURRENT_TIMESTAMP)",
(url, "done", json.dumps(data))
)
self.db.commit()
def pending(self, urls: list[str]) -> list[str]:
done = {r[0] for r in self.db.execute(
"SELECT url FROM jobs WHERE status='done'"
)}
return [u for u in urls if u not in done]2. Proxy Rotation
import random
class ProxyPool:
def __init__(self, proxies: list[str]):
self.proxies = proxies
self.failures: dict[str, int] = {}
def get(self) -> str:
available = [p for p in self.proxies
if self.failures.get(p, 0) < 3]
if not available:
raise RuntimeError("All proxies exhausted")
return random.choice(available)
def fail(self, proxy: str):
self.failures[proxy] = self.failures.get(proxy, 0) + 13. Respectful Rate Limiting
import asyncio, random
async def human_delay(min_s=1.5, max_s=4.0):
"""Mimics human reading time between requests."""
await asyncio.sleep(random.uniform(min_s, max_s))
# For heavier sites, add jitter based on page complexity
async def adaptive_delay(page_text_len: int):
base = 1.5 + (page_text_len / 10000) # longer pages = longer "read" time
await asyncio.sleep(min(base, 5.0))Real-World Results
On a B2B directory scraping project:
- —2.4 million records extracted over 6 days
- —99.2% success rate with checkpoint/resume
- —$47 total proxy cost for the full run
- —Zero manual intervention after initial launch
- —Client had been paying a data vendor $800/month for a fraction of this data
When to Use Agentic vs Traditional
| Scenario | Best Approach |
|---|---|
| Static HTML, stable structure | Traditional (BeautifulSoup) |
| JavaScript-heavy, stable structure | Playwright + selectors |
| Frequently changing layout | Agentic (LLM extraction) |
| Millions of records, long-running | Agentic + checkpointing |
| Novel pages, unknown structure | Agentic (only option) |
Frequently Asked Questions
Is web scraping legal?
Legality varies by jurisdiction and the site's Terms of Service. Always check robots.txt, respect rate limits, and only collect publicly available data. I build scrapers that target publicly accessible data and are designed to avoid overloading target servers.
Which LLM works best for extraction?
Claude 3.5 Sonnet for accuracy-critical extraction. GPT-4o-mini for high-volume cost efficiency. For structured extraction with a clear Pydantic schema, both perform similarly — the schema does most of the work.
How do you handle CAPTCHAs?
For human-solvable CAPTCHAs I use 2captcha or anti-captcha API integrations. For Cloudflare Turnstile, undetected-chromedriver + proper browser fingerprinting handles most cases. Some targets are simply not worth the engineering effort.
Can this run on a cheap VPS?
Yes. Small jobs run fine on a $6/month VPS. For large distributed scrapes I use Celery + Redis for task queuing, with multiple workers spread across different IP ranges.
