Data Dictionary¶

Authoritative for the data-dictionary/ folder, the seeders that consume it, and the validator that proves the dictionary still matches the live warehouse. Pair with sql-generator.md (the consumer) and ADR-PROTO-004 (in-prompt-context, not vector DB).

What's in `data-dictionary/`¶

Five files. Edit ONLY through pull requests reviewed by the data engineering owner.

File	Schema	Used by
table_catalog.csv	One row per `(schema, table)`: grain, row counts, primary/business/partition keys, generation order.	`dictionary_loader._load_table_catalog`
column_dictionary.csv	One row per `(schema, table, column)`: type (the `'string; varchar'` style — see sql-generator.md § Type mapper), role, FK reference, AI usage notes.	`dictionary_loader._load_column_dictionary`
join_map.csv	One row per approved join. The SQL generator MUST NOT invent joins outside this set (per `ai_query_guidelines.md` rule #1).	`dictionary_loader._load_join_map`
kpi_catalog.csv	Pre-curated KPI definitions with ready-to-template `formula_sql`.	`dictionary_loader._load_kpi_catalog`
ai_query_guidelines.md	Human-curated SQL generation rules (joins via `join_map` only, prefer claim-level KPIs from `ev_claim`, use `COUNT(DISTINCT claim_id)` for volume, etc.).	Surfaced as `DataDictionary.guidelines` for Prompt 3's Bedrock system prompt.

Today the dictionary describes 4 schemas (l3_asurion, l3_att, l3_ubreakifix, l3_verizon), 52 tables, 1663 columns, 138 joins, 42 KPIs.

Source-of-truth discipline¶

The dictionary is the single source of truth for column shape, types, and approved joins. This is non-negotiable, and it is rule #6 in CLAUDE.md § Build discipline call-outs:

Adding a metric or a table means adding a row in the CSVs, not hardcoding in Python.

Concretely:

The l3_asurion seeder (backend/scripts/databricks_mock_data/l3_asurion_main.py) renders DDL on the fly from column_dictionary.csv via app.sql_gen.type_mapper.to_databricks_type. There is no static 02_tables.sql for l3_asurion. If a column in the dictionary disagrees with what the seeder would build, the seeder fails — drift cannot survive.
The validator (backend/scripts/validate_dictionary.py) checks the dictionary against itself (FK targets resolve, KPI base_tables exist) AND against the live warehouse (every dictionary column appears in DESCRIBE TABLE with an equivalent type).
The SQL generator (Prompt 3, not yet in the code) will scope Bedrock prompts via DataDictionary.subset_for_tables(...) and reject any join outside join_map.csv.

How the dictionary is mounted¶

docker-compose.yml lines 79-87 mount the folder read-only:

volumes:
  - ./data-dictionary:/app/data-dictionary:ro

DATA_DICTIONARY_ROOT=/app/data-dictionary is exported to the api container so backend/app/sql_gen/dictionary_loader.py finds it. The :ro flag means a routing typo or a CSV edit cannot be quietly fixed inside the container instead of the repo — every change goes through git.

Workflow: seed `l3_asurion` into Databricks¶

make seed-databricks-l3 invokes backend/scripts/databricks_mock_data/l3_asurion_main.py.

flowchart TB
    a["data-dictionary/table_catalog.csv<br/>+ column_dictionary.csv"]
    b["dictionary_loader.get_default()"]
    c["For each FQN in TABLES_TO_SEED:<br/>render CREATE OR REPLACE TABLE"]
    d["app.sql_gen.type_mapper.to_databricks_type<br/>(per column)"]
    e["l3_asurion_generators<br/>generate_product_catalog_rows (500)<br/>generate_claim_rows (5000)"]
    f["uploader.insert_rows<br/>(reused from v1 seeder)"]
    dbx[("workspace.l3_asurion.ev_product_catalog<br/>workspace.l3_asurion.ev_claim")]

    a --> b --> c
    c --> d
    c --> e
    d --> f
    e --> f
    f --> dbx

Currently seeded tables (in TABLES_TO_SEED):

l3_asurion.ev_product_catalog — 500 rows. The catalog is parent; rows are picked first so ev_claim.product_catalog_id can FK to a real value. Names drawn from a fixed pool of ~20 device models so grouping queries return readable buckets.
l3_asurion.ev_claim — 5000 rows. Status mix weighted (RESOLVED 55%, IN_PROGRESS 18%, PENDING_PARTS 7%, …) so the demo bar chart isn't pathologically uniform. Timestamps within the last 90 days so the time_window_start filter on the Databricks-routed metrics returns non-empty results.

--reset is idempotent: drops + recreates both tables. The whole run takes ~35s against a Serverless Starter warehouse.

Adding a third seeded table is one line in TABLES_TO_SEED plus a generator function. The DDL writes itself.

Workflow: validate the dictionary¶

make validate-dictionary invokes backend/scripts/validate_dictionary.py.

Structural pass (always runs)¶

Every column_dictionary.foreign_key_reference resolves to a real schema.table.column triple in the dictionary itself.
Every kpi_catalog.base_table exists in table_catalog.
Schemas listed in table_catalog but absent from column_dictionary print WARN: schema sparse.

Today: 0 errors, 0 warnings against the real dictionary.

Live pass (runs when `DATABRICKS_HOST` is set)¶

For each FQN in DATABRICKS_VALIDATE_TABLES (default: l3_asurion.ev_claim,l3_asurion.ev_product_catalog):

Run DESCRIBE TABLE via backend/app/databricks/client.py.
Assert every dictionary-declared column is present.
Assert types are equivalent per app.sql_gen.type_mapper.equivalent.

Tables described in the dictionary but not in the seeded subset print WARN: described but not seeded. Not a failure — full Asurion EDW parity is a Phase-2 follow-up. Override the seeded subset via DATABRICKS_VALIDATE_TABLES=schema.table,....

Exit semantics¶

Exit 0 if structural pass green and live pass green-or-warn-only. Non-zero on any structural failure or live mismatch within the seeded subset.

If the live pass fails because the seeded subset isn't populated (run without make seed-databricks-l3), the validator reports each error with Did you run \make seed-databricks-l3` first?` rather than crashing — see the hardened error handler in backend/scripts/validate_dictionary.py.

How the dictionary feeds the SQL generator¶

The plan in docs/plans/active/part-c-databricks-prototype.md (Prompt 3) wires this:

The data resolver receives a DataIntent from the frontend.
It loads metrics_catalog.source_query for the requested metric.
It calls DataDictionary.subset_for_tables(["l3_asurion.ev_claim", ...]) to scope the dictionary to only the tables the metric touches.
It splices subset.guidelines (verbatim ai_query_guidelines.md) into the Bedrock system prompt so the generator gets the human-curated SQL rules.
It runs the generated SQL through the safety layer (SELECT-only, allowlisted tables from subset.tables, allowlisted joins from subset.joins).
It executes against Databricks via app.databricks.client.

The dictionary is the only context the generator sees about the warehouse. There is no vector DB lookup, no schema introspection, no embeddings. ADR-PROTO-004 deliberately chose this: the catalog is small enough (~1.6k columns) that an in-prompt slice is the right shape.

Sparse-schema policy¶

A schema present in table_catalog.csv but absent from column_dictionary.csv is recorded as sparse. The loader warns; the validator surfaces it. Today no schema is sparse. backend/tests/test_dictionary_loader.py test_no_sparse_schemas_today asserts the empty list so a future revision dropping a schema's columns surfaces here as a test failure rather than a silent runtime warning.

Type translation reference¶

Full table in sql-generator.md § Type mapper. The high-bits:

Pick the FIRST half of 'a; b' — 'string; varchar' → STRING. The second half documents the source-system alias.
decimal(38,0) is parametric; precision/scale preserved.
An unknown type raises ValueError so dictionary drift fails the seeder/validator immediately.

Cross-references¶

data-dictionary/ai_query_guidelines.md — the source for DataDictionary.guidelines.
sql-generator.md — the consumer.
ADR-PROTO-004 — why the dictionary is in prompt context, not a vector DB.
docs/plans/active/prompt-2-dictionary-loader.md — execution log for the loader + seeder + validator.