Skip to content

Data Dictionary & Ingestion Architecture

Authoritative source: data-dictionary/, backend/app/sql_gen/dictionary_loader.py

Four CSV files define every table, column, join, and KPI in the Asurion EDW. Those same files create the Databricks warehouse tables, prove they match, and scope every LLM prompt at query time. The dictionary is the only view the SQL generator has of the warehouse — there is no vector DB, no schema introspection, no embeddings (ADR-PROTO-004).

Why Command Center does this

Generic LLM-to-SQL tools let the model hallucinate joins and reference columns that don't exist. Command Center anchors every generated query to a curated dictionary: the LLM sees only approved joins from join_map.csv, only validated columns from column_dictionary.csv, and only pre-curated KPI formulas from kpi_catalog.csv. The full dictionary fits in Claude's context window, so there's no retrieval step to tune or debug — per ADR-PROTO-004: "Load the entire data dictionary into a Python module, cache in memory at boot, inject the relevant subset into the Bedrock prompt context per request."

The full picture

flowchart LR
    subgraph dict["Data Dictionary (4 CSVs + guidelines)"]
        tc["table_catalog.csv\n52 tables"]
        cd["column_dictionary.csv\n1,664 columns"]
        jm["join_map.csv\n138 approved joins"]
        kc["kpi_catalog.csv\n42 KPIs"]
        guide["ai_query_guidelines.md"]
    end

    loader["dictionary_loader.py\nLRU-cached DataDictionary"]

    subgraph seed["Seeding (make seed-databricks-l3)"]
        tm["type_mapper.py\nstring;varchar → STRING"]
        ddl["render CREATE TABLE\n(on-the-fly from CSVs)"]
        gen["l3_asurion_generators.py\n500 product rows\n5,000 claim rows"]
    end

    dbx[("Databricks SQL Warehouse\nworkspace.l3_asurion")]

    subgraph val["Validation (make validate-dictionary)"]
        s1["Structural pass\nFK resolution\nKPI base tables"]
        s2["Live pass\nDESCRIBE TABLE\ntype equivalence"]
    end

    subgraph runtime["Runtime (POST /v1/widgets/sql/generate)"]
        sub["subset_for_tables()\nscoped slice"]
        prompt["Bedrock prompt\nmetric + subset + guidelines"]
        safety["safety.py\nSELECT-only, allowlist,\nLIMIT injection"]
    end

    dict --> loader
    loader --> tm --> ddl --> dbx
    loader --> gen --> dbx
    dbx --> s2
    dict --> s1
    loader --> sub --> prompt --> safety --> dbx

    style dict fill:#f0f4ff,stroke:#4a6fa5
    style seed fill:#fff4e6,stroke:#c47f17
    style val fill:#e8f5e9,stroke:#2e7d32
    style runtime fill:#fce4ec,stroke:#c62828

The dictionary participates in three distinct phases — each described below.

The dictionary: 4 files, 1 truth

erDiagram
    table_catalog ||--|{ column_dictionary : "schema + table_name"
    table_catalog ||--o{ join_map : "from_table / to_table"
    table_catalog ||--o{ kpi_catalog : "base_table + schema_scope"
    column_dictionary ||--o{ join_map : "from_column / to_column"

    table_catalog {
        string schema "l3_asurion, l3_att, l3_verizon, l3_ubreakifix"
        string table_name "ev_claim, ev_product_catalog, ..."
        string grain "one row per claim"
        int row_count "5000"
        string primary_key_columns "claim_hkey"
        string client_isolation_rule "Do not join across schemas"
    }
    column_dictionary {
        string column_name "cost_avoided_usd"
        string data_type "decimal(15,2)"
        string column_role "metric"
        string foreign_key_reference "l3_asurion.ev_claim_key.claim_id"
        string ai_usage_notes "Good candidate for SUM aggregations"
    }
    join_map {
        string from_table "ev_claim"
        string to_table "ev_claim_key"
        string relationship_type "many_to_one"
        string confidence "high"
        string ai_join_rule "Join only within same schema"
    }
    kpi_catalog {
        string kpi_name "Claim Volume"
        string formula_sql "COUNT(DISTINCT claim_id)"
        string date_column "claim_started_dttm_utc"
        string default_grain "month"
    }

ai_query_guidelines.md is the fifth file — a prose document surfaced verbatim into the Bedrock system prompt as DataDictionary.guidelines. It enforces: joins via join_map.csv only, per-schema data isolation, COUNT(DISTINCT claim_id) for volume KPIs, and aggregated chart-ready output (no raw row dumps).

dictionary_loader.py reads all five files with stdlib csv.DictReader, builds a DataDictionary Pydantic model (data_dictionary.py), and LRU-caches the result on (root, mtime_signature). Edits to the CSVs are picked up on the next call without a container restart.

From dictionary to warehouse

flowchart TB
    csv["table_catalog.csv + column_dictionary.csv\n52 tables, 1,664 columns"]
    loader["dictionary_loader.get_default()"]
    loop["For each FQN in TABLES_TO_SEED"]

    subgraph ddl_path["DDL Generation"]
        mapper["type_mapper.to_databricks_type()\nstring;varchar → STRING\ndecimal(15,2) → DECIMAL(15,2)\ntimestamp;timestamp(3) → TIMESTAMP"]
        create["render_create_table()\nCREATE OR REPLACE TABLE\nfrom dictionary columns"]
    end

    subgraph data_path["Data Generation"]
        gen_prod["generate_product_catalog_rows()\n~500 rows, 20 device models"]
        gen_claim["generate_claim_rows()\n~5,000 rows, FK to products\nweighted status mix:\nRESOLVED 55% | IN_PROGRESS 18%\nPENDING_PARTS 7% | ..."]
    end

    upload["uploader.insert_rows()"]
    dbx_tables[("workspace.l3_asurion\n.ev_product_catalog (500 rows)\n.ev_claim (5,000 rows)")]

    csv --> loader --> loop
    loop --> mapper --> create --> upload
    loop --> gen_prod --> upload
    loop --> gen_claim --> upload
    upload --> dbx_tables

    style ddl_path fill:#fff4e6,stroke:#c47f17
    style data_path fill:#e3f2fd,stroke:#1565c0

make seed-databricks-l3 runs l3_asurion_main.py. There is no static SQL file for table definitions — DDL is rendered on the fly from the dictionary via type_mapper.py. If a column type in the CSV doesn't map to a Databricks type, to_databricks_type() raises ValueError and the seed fails immediately. The --reset flag drops and recreates both tables; the whole run takes ~35s against a Serverless Starter warehouse.

Proving the dictionary matches reality

flowchart TB
    subgraph pass1["Pass 1: Structural (always runs, offline)"]
        fk["FK resolution\nevery foreign_key_reference\nresolves to a real column"]
        kpi["KPI base tables\nevery kpi_catalog.base_table\nexists in table_catalog"]
        sparse["Sparse schema detection\nschemas in table_catalog\nmissing from column_dictionary"]
    end

    subgraph pass2["Pass 2: Live (when DATABRICKS_HOST is set)"]
        desc["DESCRIBE TABLE\nfor each seeded FQN"]
        col_check["Column presence\nevery dictionary column\nmust appear in DESCRIBE"]
        type_check["Type equivalence\ntype_mapper.equivalent()\ndictionary type == warehouse type"]
    end

    dict["data-dictionary/*.csv"] --> pass1
    dbx[("Databricks\nworkspace.l3_asurion")] --> desc
    desc --> col_check --> type_check

    result{{"Exit 0: all green\nExit non-zero: structural failure\nor live type mismatch"}}

    pass1 --> result
    type_check --> result

    style pass1 fill:#e8f5e9,stroke:#2e7d32
    style pass2 fill:#fff3e0,stroke:#e65100

make validate-dictionary runs validate_dictionary.py. The structural pass works entirely from the CSV files — no Databricks connection needed. The live pass runs DESCRIBE TABLE against each seeded table and asserts every dictionary-declared column is present with an equivalent type (via type_mapper.equivalent()). Tables in the dictionary but outside the seeded subset produce a WARN, not an error — full EDW parity is Phase 2.

At query time: scoped context, not the whole catalog

sequenceDiagram
    participant FE as Frontend
    participant API as Data Resolver
    participant Route as metric_routing.yaml
    participant Dict as DataDictionary
    participant LLM as Bedrock (Sonnet 4.6)
    participant Safe as safety.py
    participant DBX as Databricks

    FE->>API: POST /v1/widgets/{id}/data
    API->>Route: lookup metric_name
    Route-->>API: backend: databricks, cache: 300s

    API->>Dict: subset_for_tables(["l3_asurion.ev_claim"])
    Dict-->>API: scoped tables + joins + KPIs + guidelines

    API->>LLM: system prompt (subset + guidelines)<br/>+ metric + few-shot examples
    LLM-->>API: {sql, tables_used, explanation}

    API->>Safe: validate(sql, allowlist)
    Note over Safe: SELECT-only check<br/>table allowlist check<br/>LIMIT injection/cap<br/>no DDL/DML

    Safe-->>API: SafetyResult(is_safe=true, modified_sql)
    API->>DBX: execute(modified_sql)
    DBX-->>API: rows + columns
    API-->>FE: {data, source: "bedrock", generated_sql, ...}

The 1,664-column dictionary never hits the LLM as a whole. DataDictionary.subset_for_tables() (data_dictionary.py) returns only the tables, joins, and KPIs relevant to the requested metric. prompts.py splices this subset plus verbatim ai_query_guidelines.md into the Bedrock system prompt. After generation, safety.py re-derives the table set from the SQL AST (via sqlglot, dialect databricks) — it does not trust the LLM's tables_used claim.

Components

Component File Role
Dictionary loader backend/app/sql_gen/dictionary_loader.py Parses CSVs into DataDictionary, LRU-cached
Data dictionary model backend/app/sql_gen/data_dictionary.py DataDictionary, TableSchema, ColumnSchema, JoinKey, KpiDefinition
Type mapper backend/app/sql_gen/type_mapper.py to_databricks_type(), equivalent()
Seeder backend/scripts/databricks_mock_data/l3_asurion_main.py Renders DDL from dictionary, generates rows, uploads
Generators backend/scripts/databricks_mock_data/l3_asurion_generators.py Deterministic synthetic data (product catalog, claims)
Validator backend/scripts/validate_dictionary.py Two-pass: structural + live warehouse check
SQL generator backend/app/sql_gen/generator.py Orchestrates metric resolution, prompt build, LLM call, safety, execution
Safety layer backend/app/sql_gen/safety.py 5-check validation via sqlglot (SELECT-only, table allowlist, LIMIT, no DDL/DML)
Routing config config/metric_routing.yaml Maps each metric to postgres or databricks backend
Databricks client backend/app/databricks/client.py Pooled SQL Warehouse client with typed exceptions
Data resolver backend/app/widgets/data_resolver.py Dispatches to postgres or databricks based on routing

Interface contracts

Boundary Schema File
CSV → Python DataDictionary (Pydantic) data_dictionary.py
Python → Databricks DDL to_databricks_type() return type_mapper.py
LLM → Safety {sql, tables_used, explanation} tool output generator.py
Safety → Execution SafetyResult with modified_sql safety.py
Resolver → Frontend DataResolverResponse data_resolver.py

Out of scope

  • Metrics catalog and boot validation — covered in data-dictionary.md and sql-generator.md.
  • Cache layercache.py handles Redis TTL caching; not part of the dictionary flow.
  • Frontend renderingSourceBadge, MetricInfoBadge, and the Generated SQL tab consume the response but don't interact with the dictionary.
  • Cross-schema queries — the dictionary describes 4 schemas but only l3_asurion is seeded today; cross-schema UNION queries are Phase 2.