Most of my blocks are silent with empty bodies and no visible challenge. What now?

That is the signature of a fingerprint mismatch. Align the TLS client and the Sec-CH-UA, Accept-Language, and User-Agent headers to one current browser identity reused by the Playwright context. Do not rotate IPs to mask it.

Handling CAPTCHA and Anti-Bot Measures on State Portals

This guide is part of the Headless Browser Fallback Strategies cluster within the broader Secretary of State Portal & API Ingestion discipline. Where the parent cluster specifies how to drive a browser as a last-resort ingestion tier, this page covers the one boundary that tier is never allowed to cross: an anti-bot challenge.

Scope

This page covers how an ingestion pipeline detects and classifies CAPTCHA, WAF, and behavioral anti-bot signals on state business registries, how it aligns its request fingerprint to avoid provoking those signals on portals that permit automated reads, and how it performs a deterministic hard stop with audit-grade escalation when a challenge is unavoidable. It deliberately excludes any technique for solving, bypassing, or defeating a CAPTCHA — token farms, automated solvers, residential-proxy rotation to evade reputation scoring. Those defeat an access control and are out of bounds for a compliance system. Backoff scheduling for the transient 429/503 signals that often accompany challenges lives in Async Polling & Rate Limiting; the categorization of each terminal outcome lives in Error Categorization & Retry Logic.

Why a CAPTCHA is a hard stop, not a puzzle

A CAPTCHA is an explicit access-control signal that the operator does not want the request automated. Programmatically defeating it pushes the read across the authorization line the Computer Fraud and Abuse Act (18 U.S.C. § 1030) draws around “exceeding authorized access,” and it typically violates the portal’s terms of service. That legal exposure is not worth the data — especially when the same status can be obtained through a sanctioned API tier or a human-in-the-loop fallback. The countervailing pressure is real: statutory filing windows such as Delaware’s annual report deadline (Del. Code tit. 8 § 502) and California’s biennial Statement of Information (Cal. Corp. Code § 1502) mean a silently stale good-standing record can lead to administrative dissolution. The architecture resolves that tension by treating a challenge as a terminal, logged, escalated state — never a failure to retry into oblivion, and never a barrier to brute-force.

Anti-bot signal taxonomy

Detection is a first-class compliance control, not an afterthought. Four signal classes drive the routing decision:

HTTP status and header triggers. 403 Forbidden and 429 Too Many Requests frequently pair with Retry-After, cf-mitigated, or x-waf-challenge headers. A 302 to /captcha, /verify, or /challenge with randomized query parameters signals immediate session invalidation.
DOM-injected challenges. reCAPTCHA v2/v3, hCaptcha, and proprietary image grids are injected by JavaScript after load. Static selectors miss them; watch for iframe[src*="recaptcha"], .h-captcha, or #challenge-running appearing once document.readyState === "complete".
Fingerprint validation. WAFs (Cloudflare, Akamai, Imperva) compare the TLS client hello against Sec-CH-UA, Accept-Language, and User-Agent. A mismatch between transport fingerprint and HTTP headers triggers silent challenge injection before the first form submission — the single most common cause of “it works in my browser but not the script.”
Behavioral telemetry. Mouse trajectory, scroll depth, and navigator.webdriver presence are scored. A headless context that advertises automation is flagged on contact.

The first three classes are about not provoking a challenge on portals that tolerate automation; the fourth is the line the pipeline refuses to cross by faking. The taxonomy below maps the behavior onto the jurisdictions an entity-management team meets most often.

Jurisdiction	Portal	Dominant anti-bot mechanism	Pipeline disposition
Delaware (DE)	Division of Corporations	Session-token rotation + `429` under burst	Header-aligned request, then async backoff
California (CA)	bizfileOnline	Cloudflare managed challenge + `cf-mitigated`	Headless read; hard stop on interactive challenge
New York (NY)	DOS Corporation Search	Akamai sensor + behavioral scoring	Single-intent headless; no input spoofing
Texas (TX)	SOSDirect	Authenticated session + reCAPTCHA on login	Human-in-the-loop; never auto-solved

Prerequisites

Python 3.10+ (structural-pattern match, X | Y unions).
httpx>=0.27 with HTTP/2 for the direct tier; playwright>=1.44 (Chromium) for the headless tier.
An append-only audit sink (DynamoDB with a conditional write, or S3 with Object Lock) for AuditRecord persistence.
A durable human-review queue (Redis, SQS) the pipeline can escalate terminal challenges to.
Outbound IP space that is not rotated to evade reputation scoring — one stable, attributable egress per jurisdiction.

Implementation: detect, align, hard-stop

The module below runs the direct tier with a fingerprint-consistent client, classifies any anti-bot signal, escalates to a single-intent headless read, and — critically — raises a terminal AntiBotChallenge the moment an interactive challenge survives into the rendered DOM rather than attempting to solve it. Every transition emits structured JSON, and every record carries a SHA-256 hash over its canonical payload. Compliance-critical lines are commented inline.

import hashlib
import json
import logging
import time
from dataclasses import asdict, dataclass
from enum import Enum
from typing import Callable

import httpx
from playwright.sync_api import Page, sync_playwright

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("captcha_handler")


class ChallengeType(str, Enum):
    NONE = "none"
    HTTP_403 = "http_403"
    HTTP_429 = "http_429"
    DOM_RECAPTCHA = "dom_recaptcha"
    DOM_HCAPTCHA = "dom_hcaptcha"
    BEHAVIORAL_BLOCK = "behavioral_block"


class AntiBotChallenge(Exception):
    """Terminal: an interactive challenge was reached. Never solved in-band."""

    def __init__(self, entity_id: str, kind: ChallengeType) -> None:
        self.entity_id, self.kind = entity_id, kind
        super().__init__(f"{kind.value} on {entity_id}")


@dataclass(frozen=True)
class AuditRecord:
    entity_id: str
    jurisdiction: str
    timestamp: float
    challenge_type: ChallengeType
    resolution_method: str   # direct | headless | escalated_human_review | cache
    response_hash: str
    status_code: int


# Fingerprint consistency: the TLS client and these headers must agree, or a WAF
# injects a silent challenge before the first byte of HTML is parsed.
_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Sec-CH-UA": '"Chromium";v="147", "Not_A Brand";v="8", "Google Chrome";v="147"',
    "Sec-CH-UA-Platform": '"macOS"',
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36"
    ),
}


def _hash(entity_id: str, jurisdiction: str, payload: bytes) -> str:
    # Canonical, tamper-evident provenance for every browser- or API-sourced read.
    return hashlib.sha256(
        f"{entity_id}:{jurisdiction}:{payload.hex()}".encode()
    ).hexdigest()


def _classify(resp: httpx.Response) -> ChallengeType:
    if resp.status_code == 403:
        return ChallengeType.HTTP_403
    if resp.status_code == 429:
        return ChallengeType.HTTP_429
    body = resp.text.lower()
    if "g-recaptcha" in body or "recaptcha" in body:
        return ChallengeType.DOM_RECAPTCHA
    if "h-captcha" in body or "hcaptcha" in body:
        return ChallengeType.DOM_HCAPTCHA
    return ChallengeType.NONE


class CaptchaAwareClient:
    def __init__(
        self,
        jurisdiction: str,
        base_url: str,
        escalate: Callable[[str, ChallengeType], None],
        cache_ttl: int = 300,
    ) -> None:
        self.jurisdiction = jurisdiction
        self.base_url = base_url.rstrip("/")
        self.escalate = escalate          # hand-off to the human-review queue
        self.cache_ttl = cache_ttl
        self._http = httpx.Client(http2=True, timeout=15.0, headers=_HEADERS)
        self._cache: dict[str, tuple[float, bytes]] = {}

    def _log(self, event: str, **fields: object) -> None:
        logger.info(json.dumps({"event": event, "jurisdiction": self.jurisdiction, **fields}))

    def _headless_read(self, url: str, entity_id: str) -> bytes:
        with sync_playwright() as p:
            browser = p.chromium.launch(
                headless=True,
                # Surface automation honestly; do NOT spoof input telemetry to pass scoring.
                args=["--disable-blink-features=AutomationControlled"],
            )
            ctx = browser.new_context(
                viewport={"width": 1280, "height": 800},
                user_agent=_HEADERS["User-Agent"],
                locale="en-US",
                timezone_id="America/New_York",
                ignore_https_errors=False,   # invalid government TLS must fail loudly
            )
            page: Page = ctx.new_page()
            try:
                page.goto(url, wait_until="networkidle", timeout=20_000)
                page.wait_for_timeout(2_000)  # allow deferred challenge injection
                # If an interactive challenge rendered, this is terminal — stop here.
                if page.query_selector("iframe[src*='recaptcha'], .h-captcha, #challenge-running"):
                    raise AntiBotChallenge(entity_id, ChallengeType.BEHAVIORAL_BLOCK)
                return page.content().encode("utf-8")
            finally:
                ctx.close()
                browser.close()

    def fetch(self, entity_id: str) -> AuditRecord:
        url = f"{self.base_url}/entity/{entity_id}"
        cached = self._cache.get(entity_id)
        if cached and time.time() - cached[0] < self.cache_ttl:
            self._log("cache_hit", entity_id=entity_id)
            return AuditRecord(entity_id, self.jurisdiction, time.time(),
                               ChallengeType.NONE, "cache",
                               _hash(entity_id, self.jurisdiction, cached[1]), 200)

        resp = self._http.get(url)
        kind = _classify(resp)

        if kind is ChallengeType.NONE:
            payload, method = resp.content, "direct"
        else:
            # Any anti-bot signal invalidates a cached read and escalates the tier.
            self._cache.pop(entity_id, None)
            self._log("challenge_detected", entity_id=entity_id,
                      challenge_type=kind.value, status_code=resp.status_code)
            try:
                payload, method = self._headless_read(url, entity_id), "headless"
            except AntiBotChallenge as exc:
                # Terminal: log, escalate to human review, never solve in-band.
                self._log("captcha_terminal", entity_id=entity_id, challenge_type=exc.kind.value)
                self.escalate(entity_id, exc.kind)
                return AuditRecord(entity_id, self.jurisdiction, time.time(),
                                   exc.kind, "escalated_human_review",
                                   _hash(entity_id, self.jurisdiction, b""), 511)

        self._cache[entity_id] = (time.time(), payload)
        record = AuditRecord(entity_id, self.jurisdiction, time.time(), kind, method,
                             _hash(entity_id, self.jurisdiction, payload), 200)
        self._log("audit_record", **{k: (v.value if isinstance(v, Enum) else v)
                                      for k, v in asdict(record).items()})
        return record

Configuration reference

Parameter	Default	Justification
`cache_ttl`	`300` s	Bounds staleness of a good-standing read against statutory accuracy duties; short enough that a status change is reflected before a downstream filing decision.
`timeout` (httpx)	`15.0` s	Slow government portals need headroom; longer values mask a hung session that should escalate.
`wait_until`	`networkidle`	Anti-bot scripts inject after initial paint; idle-network is the earliest safe point to test for a challenge frame.
challenge settle	`2_000` ms	Window for deferred reCAPTCHA/hCaptcha injection before classification; too short yields false negatives.
HTTP `511` code	—	Marks `escalated_human_review` records as “Network Authentication Required,” a defensible status for an unbypassed challenge.
egress IP per jurisdiction	`1` stable	Attributable, non-rotating egress; rotating to dodge reputation scoring is an evasion technique this pipeline forbids.

Failure modes and fallback routing

Each outcome maps onto a category in the parent cluster’s Error Categorization & Retry Logic taxonomy, which decides whether to retry, escalate, or shelve.

403/429 with Retry-After. Transient throttling, not a wall. Categorize as retryable-rate; do not open a browser. Hand the entity to the backoff scheduler in Async Polling & Rate Limiting and purge any cached read so the next attempt is fresh.
Interactive challenge survives into the DOM. Terminal. _headless_read raises AntiBotChallenge; the handler writes an escalated_human_review record with status 511, queues the entity for an operator, and never retries automatically — a second machine attempt only deepens the reputation penalty.
Silent WAF block (empty body, no challenge frame). Almost always a fingerprint mismatch between the TLS client and the Sec-CH-UA/User-Agent headers. Categorize as configuration, align the headers to one known-good browser identity, and replay once; if it recurs, escalate rather than mutate the fingerprint further.
Session-token rotation mid-flow. __cf_bm or JSESSIONID rotates between request and render, voiding the read. Discard the context, reinitialize a single-intent session, and cap to one re-init — repeated rotation is itself a challenge signal that should escalate.

Frequently Asked Questions

Why not just call a CAPTCHA-solving service to keep the read going?

Because it defeats an access control the portal deliberately put in place, which moves the request across the authorization line the Computer Fraud and Abuse Act draws and typically breaches the site’s terms of service. For a compliance system the legal exposure dwarfs the value of one status field. The pipeline logs captcha_terminal, escalates to a human-review queue, and obtains the record through a sanctioned path instead.

How do I tell a real CAPTCHA apart from a transient rate-limit block?

Classify on signal, not status code alone. A 429 or 403 carrying Retry-After is transient throttling — route it to async backoff. A 302 to /challenge, a cf-mitigated header, or a recaptcha/hcaptcha iframe surviving into the rendered DOM is an interactive challenge — that is terminal and escalates. The _classify function and the headless DOM probe separate the two before any routing decision is made.

Most of my blocks are silent — empty bodies, no visible challenge. What now?

That is the signature of a fingerprint mismatch. The WAF compares your TLS client hello against the Sec-CH-UA, Accept-Language, and User-Agent headers; if they disagree it returns a stripped page with no challenge frame. Align all three to a single, current browser identity (the _HEADERS block) and confirm the same identity is reused by the Playwright context. Do not paper over it by rotating IPs.

How is an escalated record kept audit-defensible if no data was retrieved?

The AuditRecord for an escalation still carries the entity id, jurisdiction, timestamp, the precise challenge_type, and resolution_method="escalated_human_review", hashed and written to the append-only sink with status 511. A regulator or examiner can replay the structured log to prove the system detected the boundary, stopped lawfully, and handed the obligation to a human — which is itself evidence of due diligence.

Handling CAPTCHA and Anti-Bot Measures on State Portals #

Scope #

Why a CAPTCHA is a hard stop, not a puzzle #

Anti-bot signal taxonomy #

Prerequisites #

Implementation: detect, align, hard-stop #

Configuration reference #

Failure modes and fallback routing #

Frequently Asked Questions #

Related #