All posts
Scraping & Automation9 min read·

Stealth Web Scraping in 2026: Bypassing Bot Detection with Playwright

A technical deep-dive into modern bot detection systems and how to build Playwright scrapers that reliably pass them - browser fingerprinting, TLS signatures, behavioral patterns, and proxy architecture.

MF

Muhammad Farhan

AI Engineer · Founder of Datraxa

Bot detection has gotten dramatically better in the past two years. Cloudflare, DataDome, PerimeterX, and Akamai Bot Manager now fingerprint browsers at a level that breaks naive Playwright scrapers within seconds. This post covers exactly how modern detection works and how to build scrapers that pass it reliably at production scale.

How Modern Bot Detection Actually Works

Most developers think bot detection is about user-agent strings and IP reputation. Those checks matter, but they are the easiest to pass. The hard part is the multi-layer fingerprint that runs in the background of most enterprise anti-bot systems.

  • -Browser fingerprint - Canvas rendering, WebGL renderer strings, audio context behavior, font enumeration, and screen resolution ratios. Headless Chrome has well-known signatures in all of these.
  • -TLS fingerprint (JA3/JA4) - The TLS handshake signature of your HTTP client. Python requests has a different fingerprint than real Chrome. Many CDN-level systems reject mismatched signatures before any JavaScript runs.
  • -Behavioral signals - Mouse movement curves, scroll velocity, click timing, time-on-page distributions. Real users have statistical patterns that bots do not naturally replicate.
  • -Bot-specific leaks - Playwright injects navigator.webdriver and specific Chrome runtime flags that detection scripts check explicitly.

Patching the Obvious Playwright Leaks

The first thing every detection system checks is navigator.webdriver - Playwright sets it to true by default. You patch this by passing --disable-blink-features=AutomationControlled in the browser launch args, then overriding the property via an init script that runs before any page JavaScript. You also need to inject a fake window.chrome.runtime object and spoof the plugins array, since headless Chrome has neither. Set a realistic viewport, user agent matching a current Chrome version, locale, and timezone to match a plausible real user profile.

Fixing the TLS Fingerprint

Even a perfectly patched Playwright browser gets detected at the CDN layer if your TLS handshake looks like a Python HTTP client. The solution is curl-cffi, a Python library that impersonates the exact TLS fingerprint of a real Chrome browser. For pages that do not require JavaScript execution, curl-cffi with impersonate set to a current Chrome version bypasses TLS-level detection entirely without spinning up a full browser context. Reserve Playwright for pages that need actual rendering or interaction.

Proxy Architecture

IP reputation is the cheapest signal for detection systems to act on. A freshly rented datacenter IP that hits a site 500 times in an hour gets blocked regardless of how well the browser fingerprint is patched.

Proxy typeCostTrust scoreBest for
Datacenter$0.01-0.10/GBLow - flagged fastInternal tools, unprotected sites
Residential$2-8/GBHighMost e-commerce and media sites
ISP/Static residential$5-15/GBVery highHighly protected targets, long sessions
Mobile$15-30/GBHighestFinancial, social, highest-security targets

For most production scrapers, residential proxies with session stickiness - the same IP held for the full duration of a user session - give the best trust-to-cost ratio. Rotate sessions between distinct scraping targets, not between individual requests within a single session.

Behavioral Humanization

Detection systems model the statistical distribution of real user behavior. A scraper that loads a page and immediately extracts data looks nothing like a human. You need randomized delays between 800ms and 2400ms after page load, scroll events that move down the page in irregular increments rather than jumping to the bottom instantly, and occasional mouse movements before clicking interactive elements. The goal is not perfect human mimicry but staying within the normal distribution of real user timing patterns that detection thresholds are tuned against.

Checkpoint and Resume at Scale

At scale, scrapers get blocked partially - some pages succeed, some return CAPTCHAs or 403s. A scraper without checkpointing restarts from zero on every failure. The pattern is simple: maintain a persistent set of completed URLs (a JSON file or a lightweight SQLite table), check membership before each request, write to it immediately after each successful scrape. On restart, skip everything already in the set. This turns a catastrophic failure into a resumable job with no duplicate work.

Frequently Asked Questions

Does playwright-stealth still work in 2026?

The original playwright-stealth package is outdated and no longer maintained. Apply the navigator.webdriver patch, browser launch args, and init script overrides manually as described above. Manual patching gives more control over exactly what gets overridden and is easier to keep current as detection methods evolve.

How do I handle Cloudflare Turnstile CAPTCHAs?

Turnstile is browser-challenge based - it runs JavaScript and checks the output. With a properly patched Playwright context and residential proxy, Turnstile auto-solves without user interaction in most cases. If it still fires, CAPTCHA solving services like 2captcha and CapSolver both offer Turnstile-specific APIs with reasonable solve rates.

What is the ethical boundary for web scraping?

Scrape publicly accessible data, respect robots.txt for non-commercial research, do not scrape personal data without legal basis, and rate-limit requests to avoid degrading performance for real users. For commercial use, check the site terms of service and consider whether a data licensing agreement exists before building a production pipeline against the target.

Share this article

Work with me

Need This Built?

I am available for freelance projects, consulting, and remote AI engineering roles. If you need an agentic system, CV pipeline, or scraping infrastructure built properly - let's talk.

Usually responds within 24 hours  ·  Based in Islamabad, open to remote globally