Why html5lib rather than lxml or the stdlib parser?

Legacy ASP.NET and ColdFusion tables routinely omit closing tr and td tags and rely on browser tolerance. The stdlib parser truncates at the first malformed boundary, dropping rows; lxml recovers some but not all. html5lib implements the full WHATWG tree-construction algorithm a browser uses, reconstructing the intended row structure.

How do I keep a three-row merged cell from shifting the columns below it?

The span-matrix step tracks every rowspan in a carry map keyed by column index and emits the carried value into each subsequent row before reading that row's own cells. Without it, the rows beneath a rowspan cell are short one column and every field to its right shifts left, so a different entity inherits the standing.

Parsing Inconsistent HTML Tables From Legacy State Portals

This guide is part of the Pagination Handling for Bulk Records area within the Secretary of State Portal & API Ingestion framework: each page you sweep returns a block of HTML, and before any of those entities can be counted toward a completeness gate they have to be parsed into structured rows — reliably, across portals that never agreed on a table schema.

Scope of This Page

This page covers turning one page of legacy-portal HTML into a list of typed compliance records: how to reconstruct a malformed DOM, how to resolve rowspan/colspan into a rectangular grid, how to detect the header row when the markup never uses <th>, and how to bind each extracted record to a tamper-evident hash. It deliberately excludes the surrounding machinery documented elsewhere — the cursor-state and completeness logic of the parent area that decides which pages to fetch and when you have them all, the headless browser fallback chain that takes over when a table is rendered by client-side JavaScript rather than served as HTML, and the deterministic Error Categorization & Retry Logic taxonomy that classifies why a fetch failed. Here we assume the HTML is already in hand and focus only on extracting trustworthy rows from it.

The Constraint That Forces Deterministic Parsing

Misreading a table is not a cosmetic bug — it manufactures a false compliance fact. Under the Model Business Corporation Act § 16.01 and each state’s foreign-qualification statute, legal operations must accurately capture an entity’s standing, registered agent, and next report date. A parser that lets a rowspan="3" merged “Active / Good Standing” cell bleed into the next entity’s row will silently report a delinquent company as active, suppressing the deadline that would have driven its State Filing Deadline Calendars entry. In Delaware that error costs a $200 penalty plus 1.5% monthly interest under 8 Del. C. § 510 and, eventually, administrative dissolution; in California it triggers suspension of corporate powers under Cal. Corp. Code § 2205. Because the downstream cost is a missed statutory filing, the parser must fail loud — raising on a structurally impossible table — rather than emit a plausible-but-wrong row that flows unchecked into the Compliance Metadata Schemas of record.

Prerequisites

Python 3.10+ — for X | Y unions, match statements, and modern typing.
beautifulsoup4 4.12+ with the html5lib parser installed — html5lib applies the WHATWG tree-construction rules, so it reconstructs unclosed </tr> tags and stray markup the way a browser would, instead of truncating the table the way the stdlib parser does.
Standard library only beyond that: hashlib, json, logging, re, dataclasses, datetime.
An append-only audit sink (write-once Postgres table, or object storage with Object Lock) to hold the raw-HTML snapshot alongside each extracted record.
The raw HTML and a known entity_id for the page — produced upstream by the parent area’s fetch loop.

Implementation: A Deterministic Table Parser

The module below normalises the DOM with html5lib, walks every <tr> into a virtual grid that carries merged cells downward, detects the header row by content rather than tag name, and binds the extracted payload to a SHA-256 chain so the record is tamper-evident. The design rule throughout is resolve structure before reading values: never index a fixed column, because the column that holds “Status” in Delaware holds “Agent” in Texas. Comments mark the compliance-critical lines.

from __future__ import annotations

import hashlib
import json
import logging
import re
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from typing import Optional

from bs4 import BeautifulSoup, Tag

# Structured JSON logging — every line is a parseable audit/observability event.
logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("ingestion.table_parser")

# Header synonyms vary by portal; we match canonical fields against any known alias.
HEADER_ALIASES: dict[str, frozenset[str]] = {
    "status": frozenset({"status", "standing", "entity status", "status detail"}),
    "registered_agent": frozenset({"registered agent", "agent", "agent name", "ra"}),
    "next_filing_date": frozenset({"next report", "due date", "next filing date", "report due"}),
}


@dataclass(frozen=True)
class ComplianceRecord:
    entity_id: str
    jurisdiction: str
    status: str
    registered_agent: str
    next_filing_date: Optional[str]
    extraction_timestamp: str
    raw_html_hash: str          # hash of the exact bytes parsed — pins the record to its source
    audit_signature: str        # binds payload + raw hash + time into one tamper-evident digest


class PortalTableParser:
    """Deterministic extractor for structurally inconsistent legacy registry tables."""

    def __init__(self, jurisdiction: str) -> None:
        self.jurisdiction = jurisdiction

    @staticmethod
    def _resolve_span_matrix(table: Tag) -> list[list[str]]:
        """Flatten a table into a rectangular grid, carrying rowspan/colspan correctly.

        Legacy portals merge 'Active / Good Standing / Tax Paid' under one rowspan cell;
        if that value is not carried DOWN into later rows, every column after it shifts
        left and a delinquent entity reads as active. This is the core correctness step.
        """
        grid: list[list[str]] = []
        # col_index -> (value, rows_still_to_fill) for cells spilling down from above.
        carry: dict[int, tuple[str, int]] = {}

        for tr in table.find_all("tr"):
            row: list[str] = []
            col = 0
            for cell in tr.find_all(["td", "th"]):
                # Emit any cells still spilling down from a rowspan above this position.
                while col in carry:
                    value, remaining = carry[col]
                    row.append(value)
                    if remaining <= 1:
                        carry.pop(col, None)
                    else:
                        carry[col] = (value, remaining - 1)
                    col += 1

                text = re.sub(r"\s+", " ", cell.get_text(separator=" ", strip=True))
                rowspan = max(1, int(cell.get("rowspan", 1)))
                colspan = max(1, int(cell.get("colspan", 1)))
                for c in range(colspan):
                    row.append(text)
                    if rowspan > 1:
                        carry[col + c] = (text, rowspan - 1)
                col += colspan

            # Drain trailing rowspans that fall to the right of the last real cell.
            while col in carry:
                value, remaining = carry[col]
                row.append(value)
                if remaining <= 1:
                    carry.pop(col, None)
                else:
                    carry[col] = (value, remaining - 1)
                col += 1
            grid.append(row)
        return grid

    @staticmethod
    def _detect_header(grid: list[list[str]]) -> tuple[int, dict[str, int]]:
        """Find the header row by CONTENT, not by <th>: legacy portals rarely use <th>."""
        for idx, row in enumerate(grid):
            mapping: dict[str, int] = {}
            for col, cell in enumerate(row):
                key = cell.strip().lower()
                for field, aliases in HEADER_ALIASES.items():
                    if key in aliases and field not in mapping:
                        mapping[field] = col
            # A real header row resolves at least the status column.
            if "status" in mapping:
                return idx, mapping
        raise ValueError("no header row detected — schema drift or wrong table")

    def _audit(self, raw_html: str, payload: dict[str, object]) -> tuple[str, str]:
        raw_hash = hashlib.sha256(raw_html.encode("utf-8")).hexdigest()
        body = json.dumps({"raw": raw_hash, "payload": payload}, sort_keys=True)
        signature = hashlib.sha256(
            f"{body}:{datetime.now(timezone.utc).isoformat()}".encode("utf-8")
        ).hexdigest()
        return raw_hash, signature

    def parse(self, raw_html: str, entity_id: str) -> ComplianceRecord:
        # html5lib reconstructs unclosed </tr> tags the way a browser does, instead of
        # truncating the table the way the stdlib parser silently would.
        soup = BeautifulSoup(raw_html, "html5lib")
        table = soup.find("table")
        if not isinstance(table, Tag):
            raise ValueError("no <table> in response — likely a JS-rendered portal; fall back")

        grid = self._resolve_span_matrix(table)
        header_idx, cols = self._detect_header(grid)
        data_rows = grid[header_idx + 1:]
        if not data_rows:
            raise ValueError("header present but no data rows after span resolution")
        row = data_rows[0]

        # FAIL LOUD: a ragged row means a span resolved wrong; emitting it would shift columns.
        if any(idx >= len(row) for idx in cols.values()):
            raise ValueError(f"ragged row {len(row)} cols vs header {max(cols.values()) + 1}")

        def col(field: str) -> Optional[str]:
            return row[cols[field]] if field in cols else None

        payload = {
            "entity_id": entity_id,
            "jurisdiction": self.jurisdiction,
            "status": col("status") or "UNKNOWN",
            "registered_agent": col("registered_agent") or "UNKNOWN",
            "next_filing_date": col("next_filing_date"),
        }
        raw_hash, signature = self._audit(raw_html, payload)
        logger.info(json.dumps({
            "event": "table_extracted", "entity_id": entity_id,
            "jurisdiction": self.jurisdiction, "status": payload["status"],
            "audit_sig": signature[:12],
        }))
        return ComplianceRecord(
            **payload,
            extraction_timestamp=datetime.now(timezone.utc).isoformat(),
            raw_html_hash=raw_hash,
            audit_signature=signature,
        )

The parser is deliberately a leaf: it makes no fetch, no retry, and no fallback decision. It consumes bytes plus an entity id, returns one typed ComplianceRecord, and raises a precise ValueError the moment the structure is impossible — handing the verdict to the surrounding orchestration rather than guessing. Matching headers against an alias set (rather than fixed indices) is what lets the same code read a Delaware table whose status column is third and a Texas table whose status column is fifth.

Configuration Reference

The fields that vary between portals are data, not branches in the parse loop, because they are dictated by each registry’s markup rather than by your code.

Parameter	Suggested value	Operational justification
`html5lib` parser	required	Applies WHATWG tree construction; reconstructs unclosed `</tr>`/`</td>` instead of truncating, the failure mode of the stdlib parser.
`HEADER_ALIASES["status"]`	content match	Header row is found by text, not `<th>` — legacy ASP.NET/ColdFusion tables emit bold `<td>` headers with no `<th>` at all.
ragged-row guard	raise	A row narrower than the header means a span resolved wrong; emitting it shifts every later column and falsifies status.
whitespace normalisation	`\s+` → single space	California injects `<br>` and stray newlines inside cells; collapse before comparison or status strings never match.
`raw_html_hash` retention	per record	Pins each record to the exact bytes parsed so counsel can replay the extraction during a statutory review.
empty-status default	`"UNKNOWN"`	Never coerce a missing status to “Active”; an unknown must surface for human review, not pass a compliance gate.

Jurisdiction-specific table quirks

Jurisdiction	Portal	Table quirk	Handling
Delaware	Division of Corporations	`rowspan="3"` merges status / standing / franchise-tax into one cell	Span matrix carries the value down into all three rows
California	BizFile	`<br>` tags and `display:none` spans inside `<td>`	`get_text(separator=" ")` + `\s+` collapse before matching
New York	DOS Entity Search	unclosed `</tr>`; relies on browser tolerance	html5lib reconstructs the row boundaries
Texas	SOSDirect	status flattened into discrete `<td>`s, no `<th>` header	content-based `HEADER_ALIASES` detection

Failure Modes and Fallback Routing

Each fault maps onto the four-tier scheme defined in the parent area’s error categorization & retry logic — transient, statutory, data-validation, and system — and the parser responds to each differently.

No <table> in the response (system / structural). Increasingly portals render results with client-side JavaScript, so the HTML carries an empty container. parse raises immediately; the orchestrator does not retry the same transport but routes the entity to the headless browser fallback chain, which executes the script and returns rendered HTML for a second parse attempt.
Ragged matrix after span resolution (data-validation). A row that ends up narrower than the header means a rowspan/colspan was malformed or a cell was dropped. The guard raises rather than emitting a column-shifted row; the entity is quarantined against its compliance metadata schemas for human review instead of being recorded with a falsified status.
Header row not detected — status alias never matches (system / schema drift). When a portal renames its columns or restructures the page, _detect_header raises. Compare the raw-HTML hash against the stored baseline: a large delta confirms a schema change, which is an alert-and-patch event (extend HEADER_ALIASES), not a retry.
Status string present but unrecognised vocabulary (statutory). A cell reads “Forfeited – Failure to File” where the model expected “Delinquent”. The record extracts cleanly but the status fails downstream validation against jurisdictional nomenclature; it is routed to review so a real administrative-dissolution risk is never mapped to “Active”.

Frequently Asked Questions

Why match headers against an alias set instead of just reading fixed column positions?

Because no two legacy portals put the same field in the same column. Delaware’s status is the third cell; Texas flattens it to the fifth; California buries it behind a display:none span. A fixed index that works on one portal silently reads the wrong field on another and emits a confident, wrong status. Resolving columns by header content — matched against a synonym set per canonical field — lets one parser cover every jurisdiction and surfaces a schema change as a clean “header not detected” error instead of corrupt data.

Why html5lib rather than the faster lxml or the stdlib parser?

Legacy ASP.NET and ColdFusion tables routinely omit closing </tr> and </td> tags and rely on browser tolerance to render anyway. The stdlib parser truncates such a table at the first malformed boundary, silently dropping rows; lxml recovers some but not all. html5lib implements the full WHATWG tree-construction algorithm — the same rules a browser uses — so it reconstructs the row structure the portal’s authors actually intended. For correctness on government legacy markup, the slower parse is worth it.

A merged cell spans three rows — how do I keep it from shifting the columns below it?

The _resolve_span_matrix step tracks every rowspan in a carry map keyed by column index and emits the carried value into each subsequent row before reading that row’s own cells. Without this, a rowspan=“3” “Good Standing” cell occupies one physical <td> but three logical rows; the two rows beneath it are short one column, so every field to its right shifts left and a different entity inherits the standing. The carry map is the single most correctness-critical part of the parser.

Why raise on a ragged row instead of padding it and moving on?

Because padding hides the only signal that something went wrong. A row narrower than the header means a span resolved incorrectly or a cell was lost — and the values present are no longer aligned to their headers. Padding it produces a record that looks complete but maps “Delinquent” onto the wrong column. Raising sends the entity to quarantine for human review, where a real penalty risk is caught, rather than letting a falsified “Active” pass a compliance gate.

Parsing Inconsistent HTML Tables From Legacy State Portals #

Scope of This Page #

The Constraint That Forces Deterministic Parsing #

Prerequisites #

Implementation: A Deterministic Table Parser #

Configuration Reference #

Jurisdiction-specific table quirks #

Failure Modes and Fallback Routing #

Frequently Asked Questions #

Related #