Classifying Entities Into a Deterministic, Single-Intent Filing Taxonomy

Corporate portfolios rarely conform to a single regulatory template, yet every downstream automation depends on each entity resolving to exactly one filing obligation. This guide is part of the Core Architecture & Regulatory Mapping framework, and it owns the classification stage: the deterministic control point that ingests a raw entity record, normalizes it, and assigns the single statutory obligation that governs its annual filing before any deadline or portal logic runs.

The engineering problem is precise. Given a heterogeneous stream of entity records — Delaware C-Corps, California LLCs, foreign-qualified registrations, pass-through partnerships — produce a classification that is idempotent (the same input always yields the same class), auditable (every decision records the rule that produced it), and unambiguous (a record either resolves to one obligation or it halts for review). Get this wrong and the error propagates: a misclassified entity inherits the wrong deadline calendar, the wrong fee schedule, and the wrong portal submission path.

Statutory and Regulatory Context

Classification is not a convenience layer; the obligation it assigns is dictated by statute and differs by entity type within the same jurisdiction. In Delaware, a stock corporation owes an annual report and franchise tax under DGCL § 502, while a Delaware LLC owes a flat $300 annual tax under 6 Del. C. § 18-1107 and files no annual report at all. In California, an LLC files a Statement of Information under Cal. Corp. Code § 17702.09, a stock corporation files under § 1502, and both face administrative forfeiture for non-filing. A classifier that treats “Delaware entity” or “California entity” as a single class will route a quarter of a typical portfolio to the wrong obligation.

Jurisdiction codes themselves must be normalized against a published standard so that “DE”, “Delaware”, and “US-DE” collapse to one canonical value. This taxonomy standardizes on ISO 3166-2 subdivision codes, and entity-type aliases resolve against a canonical registry keyed to the federal classifications behind the IRS employer identification number regime. The output of this stage is the typed input contract that Compliance Metadata Schemas validate and that State Filing Deadline Calendars turn into dated obligations.

Architecture and Design Model

The classifier is a stateless, idempotent transformation with three stages in strict sequence: normalize → classify → route. Each stage is lossless and pure, so the same record produces the same class on every run, and a failure at any stage routes the record to a tiered fallback rather than corrupting the obligation downstream.

The classification stage as a stateless transform: a raw payload is normalized and classified into one obligation, then the router converts its confidence score into Tier 1 execution, Tier 2 validation, or a Tier 3 halt — only Tier 1 emits the result downstream.

Three design decisions make the stage production-grade:

Schema-first normalization. Raw input is parsed into a typed model before any rule evaluates it. Malformed jurisdiction codes or unknown aliases raise immediately, so classification logic never sees an invalid record.
Single-intent resolution. A record must resolve to exactly one primary obligation. Conflicting attributes — a Delaware formation paired with a California LLC operating agreement — trigger a pre-classification halt rather than a best-guess assignment.
Confidence-tiered routing. Every classification carries a confidence score, and the router uses thresholds to send high-confidence records straight to execution while diverting ambiguous ones to validation or human review.

Prerequisites and Dependencies

Dependency	Minimum	Role in the classifier
Python	3.10+	`match`/union-type syntax, `dataclass(frozen=True)` results
Pydantic	v2	Typed payload parsing and field-level normalization
A canonical jurisdiction registry	ISO 3166-2 subset	Maps free-text state values to `US-DE`, `US-CA`, etc.
An entity-alias map	maintained record	Resolves `c-corp`, `llc`, `lp` to canonical `EntityType`
Structured logging	stdlib `logging` + JSON formatter	Audit-grade event trail for every decision
A message bus (production)	SQS / Kafka	Async validation queue and dead-letter routing

The classifier itself has no network dependency — it is pure CPU and deterministic, which is what makes it cheap to unit-test and safe to run inside multi-entity batch orchestration without rate-limit concerns.

Step-by-Step Implementation

Phase 1 — Deterministic Normalization

Raw entity data arriving from ERP systems, HRIS platforms, or manual intake contains inconsistent casing, malformed jurisdiction codes, and legacy aliases. The normalization layer executes a strict, lossless transformation before any classification logic evaluates the payload, standardizing jurisdiction codes to ISO 3166-2 and resolving entity-type aliases against the canonical registry.

from __future__ import annotations
import re
import logging
from enum import Enum
from typing import Optional
from pydantic import BaseModel, field_validator

logger = logging.getLogger("compliance.taxonomy")

class JurisdictionCode(str, Enum):
    DE = "US-DE"
    CA = "US-CA"
    NY = "US-NY"
    TX = "US-TX"
    # Extend with the full ISO 3166-2 registry in production

class EntityType(str, Enum):
    DOMESTIC_CORP = "domestic_c_corp"
    FOREIGN_QUALIFIED = "foreign_qualified"
    LLC = "limited_liability_company"
    PARTNERSHIP = "partnership"

_JURISDICTION_MAP = {
    "DE": JurisdictionCode.DE, "DELAWARE": JurisdictionCode.DE,
    "CA": JurisdictionCode.CA, "CALIFORNIA": JurisdictionCode.CA,
    "NY": JurisdictionCode.NY, "NEW YORK": JurisdictionCode.NY,
    "TX": JurisdictionCode.TX, "TEXAS": JurisdictionCode.TX,
}

_ENTITY_ALIAS_MAP = {
    "c-corp": EntityType.DOMESTIC_CORP,
    "llc": EntityType.LLC,
    "foreign": EntityType.FOREIGN_QUALIFIED,
    "lp": EntityType.PARTNERSHIP,
}

class RawEntityPayload(BaseModel):
    entity_name: str
    formation_state: str
    entity_type_raw: str
    ein_prefix: Optional[str] = None
    fiscal_year_end_month: Optional[int] = None

class NormalizedEntity(BaseModel):
    entity_id: str
    canonical_name: str
    jurisdiction_iso: JurisdictionCode
    entity_type: EntityType
    ein_prefix: Optional[str] = None
    fiscal_year_end_month: Optional[int] = None

    @field_validator("canonical_name")
    @classmethod
    def normalize_name(cls, v: str) -> str:
        return re.sub(r"\s+", " ", v.strip().upper())

    @classmethod
    def from_raw(cls, entity_id: str, raw: RawEntityPayload) -> "NormalizedEntity":
        """Normalize a RawEntityPayload into a typed NormalizedEntity."""
        jkey = re.sub(r"\s+", " ", raw.formation_state.strip().upper())
        jurisdiction = _JURISDICTION_MAP.get(jkey)
        if jurisdiction is None:
            raise ValueError(f"Unsupported jurisdiction code: {jkey!r}")

        etype_key = raw.entity_type_raw.strip().lower()
        entity_type = _ENTITY_ALIAS_MAP.get(etype_key)
        if entity_type is None:
            raise ValueError(f"Unrecognized entity type alias: {etype_key!r}")

        return cls(
            entity_id=entity_id,
            canonical_name=raw.entity_name,
            jurisdiction_iso=jurisdiction,
            entity_type=entity_type,
            ein_prefix=raw.ein_prefix,
            fiscal_year_end_month=raw.fiscal_year_end_month,
        )

Normalization uses a factory classmethod (from_raw) rather than a validator on fields belonging to a different model. Validators on NormalizedEntity only target fields that exist on that class, and any failure raises ValueError before the object is constructed, routing the exception to the validation dead-letter queue.

Phase 2 — Single-Intent Classification

The taxonomy maps structural attributes to regulatory obligations. Domestic corporations, foreign-qualified entities, limited liability companies, and pass-through structures each trigger a distinct obligation. The classifier uses a rule-based decision tree augmented by a lightweight probabilistic fallback for edge-case descriptions, evaluating entity type, jurisdiction, EIN prefix, and fiscal year-end to assign a single obligation vector.

from dataclasses import dataclass

class ClassificationIntent(str, Enum):
    ANNUAL_REPORT = "annual_report"
    FRANCHISE_TAX = "franchise_tax"
    STATEMENT_OF_INFO = "statement_of_info"
    FOREIGN_QUALIFICATION = "foreign_qualification"

@dataclass(frozen=True)
class ClassificationResult:
    intent: ClassificationIntent
    confidence: float
    rule_applied: str
    metadata: dict

class SingleIntentClassifier:
    def __init__(self, conflict_threshold: float = 0.65) -> None:
        self.conflict_threshold = conflict_threshold

    def evaluate(self, entity: NormalizedEntity) -> ClassificationResult:
        if entity.entity_type == EntityType.DOMESTIC_CORP:
            return ClassificationResult(
                intent=ClassificationIntent.ANNUAL_REPORT,
                confidence=1.0,
                rule_applied="DOMESTIC_CORP_DEFAULT",
                metadata={"filing_template": "corp_annual_report_v3"},
            )
        if entity.entity_type == EntityType.LLC:
            return ClassificationResult(
                intent=ClassificationIntent.STATEMENT_OF_INFO,
                confidence=0.95,
                rule_applied="LLC_SOS_DEFAULT",
                metadata={"filing_template": "llc_soi_v2"},
            )
        if entity.entity_type == EntityType.FOREIGN_QUALIFIED:
            return ClassificationResult(
                intent=ClassificationIntent.FOREIGN_QUALIFICATION,
                confidence=0.90,
                rule_applied="FOREIGN_QUAL_RULE",
                metadata={"requires_registered_agent": True},
            )
        return self._probabilistic_fallback(entity)

    def _probabilistic_fallback(self, entity: NormalizedEntity) -> ClassificationResult:
        # Lightweight heuristic scoring for hybrid/edge entities.
        # Confidence is held below 0.70 to force Tier 2/3 routing.
        return ClassificationResult(
            intent=ClassificationIntent.ANNUAL_REPORT,
            confidence=0.55,
            rule_applied="HEURISTIC_FALLBACK",
            metadata={"requires_manual_review": True},
        )

Parameterizing these rules per state — rather than hardcoding brittle conditionals — is the subject of How to map LLC vs C-Corp filing requirements across 50 states, which extends this decision tree into a versioned, jurisdiction-aware rule engine.

Phase 3 — Tiered Fallback and Routing

Ambiguity is the primary driver of late filings and administrative penalties. The router converts a confidence score into a routing decision, guaranteeing that no ambiguous record silently advances to a state portal. Every halt produces an immutable event that a regulator can trace.

Tier	Trigger condition	Routing action	Audit requirement
Tier 1	Confidence ≥ 0.95, zero attribute conflicts	Direct pipeline execution	Log rule ID, timestamp, payload hash
Tier 2	Confidence 0.70–0.94, minor heuristic gaps	Async validation queue (cross-reference SOS registry)	Store confidence delta, retry count, source
Tier 3	Confidence < 0.70 or attribute contradiction	Dead-letter queue for legal-ops review	Full diagnostic payload, conflict vector, SLA timer

class RoutingErrorType(str, Enum):
    CONFLICTING_JURISDICTION = "CONFLICTING_JURISDICTION"
    MISSING_FISCAL_DECLARATION = "MISSING_FISCAL_DECLARATION"
    LOW_CONFIDENCE_THRESHOLD = "LOW_CONFIDENCE_THRESHOLD"
    SCHEMA_MUTATION = "SCHEMA_MUTATION"

class ClassificationRouter:
    def __init__(self, classifier: SingleIntentClassifier) -> None:
        self.classifier = classifier

    def route(self, entity: NormalizedEntity) -> ClassificationResult | None:
        result = self.classifier.evaluate(entity)

        if result.confidence >= 0.95:
            logger.info(
                "tier_1_match",
                extra={"entity_id": entity.entity_id, "intent": result.intent.value},
            )
            return result

        if 0.70 <= result.confidence < 0.95:
            logger.warning(
                "tier_2_heuristic",
                extra={"entity_id": entity.entity_id, "confidence": result.confidence},
            )
            self._enqueue_async_validation(entity, result)
            return result

        self._raise_compliance_alert(entity, RoutingErrorType.LOW_CONFIDENCE_THRESHOLD)
        return None

    def _enqueue_async_validation(
        self, entity: NormalizedEntity, result: ClassificationResult
    ) -> None:
        # Production: publish to SQS/Kafka with a deterministic idempotency key.
        logger.info("async_validation_queued", extra={"entity_id": entity.entity_id})

    def _raise_compliance_alert(
        self, entity: NormalizedEntity, error_type: RoutingErrorType
    ) -> None:
        logger.critical(
            "tier_3_blocked",
            extra={
                "entity_id": entity.entity_id,
                "error_type": error_type.value,
                "requires_legal_review": True,
            },
        )

This error taxonomy maps directly to statutory audit requirements and aligns with the obligation-state model that the Deadline Tracking & Routing Engines consume downstream: every classification halt generates an immutable record explaining exactly why a filing was delayed and what remediation began.

Edge Cases and Jurisdiction-Specific Gotchas

The same EntityType maps to different obligations depending on jurisdiction, and several states carry quirks that defeat a naive classifier.

Jurisdiction	Entity quirk	Classifier handling
Delaware (US-DE)	LLCs owe a flat tax but file no annual report; only corporations file reports (DGCL § 502)	Route DE + LLC to `FRANCHISE_TAX`, never `ANNUAL_REPORT`
California (US-CA)	LLCs and corporations both file a Statement of Information, on different cadences (§ 17702.09 vs § 1502)	Distinct templates per entity type within one jurisdiction
New York (US-NY)	Corporations file a biennial statement, not annual	Set a 2-year cadence flag in result metadata
Texas (US-TX)	No separate annual report; obligation rides on the franchise tax / Public Information Report	Classify TX corp/LLC to `FRANCHISE_TAX` with a PIR sub-template
Foreign-qualified (any)	A foreign registration creates obligations in both home and qualifying states	Emit `FOREIGN_QUALIFICATION` and flag for multi-jurisdiction expansion

A common silent failure is an entity registered in one state but operating under another state’s operating agreement. The single-intent model treats this as a contradiction and forces a Tier 3 halt rather than guessing — the correct behavior, because the resolution is a legal judgement, not a heuristic.

Verification and Testing

Because the classifier is pure and deterministic, it is exhaustively testable without a network. Assert that each canonical entity type maps to its expected obligation, that malformed input raises before construction, and that confidence thresholds route correctly.

import pytest

def make_entity(jurisdiction: JurisdictionCode, etype: EntityType) -> NormalizedEntity:
    return NormalizedEntity(
        entity_id="ent_test",
        canonical_name="ACME HOLDINGS",
        jurisdiction_iso=jurisdiction,
        entity_type=etype,
    )

def test_domestic_corp_is_tier_1_annual_report() -> None:
    clf = SingleIntentClassifier()
    result = clf.evaluate(make_entity(JurisdictionCode.DE, EntityType.DOMESTIC_CORP))
    assert result.intent is ClassificationIntent.ANNUAL_REPORT
    assert result.confidence >= 0.95  # Tier 1: direct execution

def test_partnership_falls_back_below_threshold() -> None:
    clf = SingleIntentClassifier()
    result = clf.evaluate(make_entity(JurisdictionCode.CA, EntityType.PARTNERSHIP))
    assert result.confidence < 0.70  # forces Tier 3 review
    assert result.metadata["requires_manual_review"] is True

def test_unknown_jurisdiction_raises() -> None:
    with pytest.raises(ValueError, match="Unsupported jurisdiction"):
        NormalizedEntity.from_raw(
            "ent_x",
            RawEntityPayload(
                entity_name="X", formation_state="Atlantis", entity_type_raw="llc"
            ),
        )

def test_low_confidence_returns_none_from_router() -> None:
    router = ClassificationRouter(SingleIntentClassifier())
    assert router.route(make_entity(JurisdictionCode.TX, EntityType.PARTNERSHIP)) is None

Property-based fixtures (via hypothesis) over the full alias and jurisdiction matrix are the most effective way to surface mapping gaps before they reach production, since they exercise combinations a hand-written suite will miss.

Troubleshooting

Records resolve to the wrong obligation for Delaware LLCs

Root cause: a default rule routing all LLCs to STATEMENT_OF_INFO. Delaware LLCs owe a flat franchise tax and file no annual report. Add a jurisdiction guard so the EntityType.LLC branch checks jurisdiction_iso and routes US-DE to FRANCHISE_TAX. Verify against the Delaware row in the jurisdiction-gotchas table above.

Validation dead-letter queue is filling with "Unsupported jurisdiction"

Root cause: free-text state values that the _JURISDICTION_MAP does not cover (e.g. “Calif.”, “N.Y.”). Normalize aggressively — uppercase, collapse whitespace, strip punctuation — and expand the alias map. Treat the dead-letter queue as a feedback signal: every distinct unmapped value is a missing registry entry, not a defect in the classifier.

The same entity gets different classes on re-runs

Root cause: non-determinism in the probabilistic fallback (random seed, dictionary ordering, or wall-clock input). The classifier must be a pure function of the NormalizedEntity. Remove any time- or randomness-dependent inputs from _probabilistic_fallback, and assert idempotency in tests by classifying the same record twice and comparing results.

Foreign-qualified entities are missing one jurisdiction's filing

Root cause: treating a foreign registration as a single obligation. A foreign qualification creates obligations in both the home and qualifying states. Emit FOREIGN_QUALIFICATION and a fan-out flag so the obligation expands into per-jurisdiction records before the deadline calendar runs.

Operational Checklist

Pre-deployment validation for the classification stage:

Every supported jurisdiction has explicit per-entity-type rules; no jurisdiction relies on a single default obligation.
_JURISDICTION_MAP and _ENTITY_ALIAS_MAP are versioned, and unmapped values raise rather than silently default.
Every ClassificationResult records the rule_applied and a confidence score for audit.
The probabilistic fallback is deterministic and held below the Tier 2 threshold.
Tier 3 halts emit an immutable diagnostic payload with the conflict vector and an SLA timer.
Idempotency is asserted in tests: identical input yields identical output across runs.
The async validation queue uses a deterministic idempotency key to prevent duplicate routing.
Delaware LLC (flat tax, no report) and New York biennial cadence are covered by explicit fixtures.

Frequently Asked Questions

Why enforce a single obligation per record instead of attaching all applicable filings?

Single-intent resolution keeps the downstream pipeline auditable and idempotent. Each obligation has its own deadline, fee, and portal path; bundling them hides conflicts and makes a partial failure ambiguous. When an entity genuinely owes multiple filings (a foreign qualification, for instance), the classifier fans those out into separate single-intent records rather than one composite class.

How does the classifier handle a brand-new entity type the rules have never seen?

It routes to the probabilistic fallback, which returns a confidence below 0.70 and a requires_manual_review flag. That forces a Tier 3 halt and a legal-ops review rather than a guess. The reviewed outcome then becomes a new explicit rule, so the same type classifies deterministically next time.

Where does this stage sit relative to deadline computation and portal submission?

It runs first. Classification produces the obligation and jurisdictional metadata; State Filing Deadline Calendars turn that into dated obligations, and the Secretary of State Portal API Ingestion layer submits them. A classification error therefore corrupts everything above it, which is why the stage halts on ambiguity instead of guessing.

Can the same Python module classify all fifty states?

The decision-tree structure generalizes, but the per-state obligation mapping must be data, not code. The full parameterization — versioned per-jurisdiction rules with effective-date ranges — is covered in How to map LLC vs C-Corp filing requirements across 50 states.

Classifying Entities Into a Deterministic, Single-Intent Filing Taxonomy #

Statutory and Regulatory Context #

Architecture and Design Model #

Prerequisites and Dependencies #

Step-by-Step Implementation #

Phase 1 — Deterministic Normalization #

Phase 2 — Single-Intent Classification #

Phase 3 — Tiered Fallback and Routing #

Edge Cases and Jurisdiction-Specific Gotchas #

Verification and Testing #

Troubleshooting #

Operational Checklist #

Frequently Asked Questions #

Related #