Data Dictionary¶
Authoritative for the data-dictionary/ folder, the seeders that consume it, and the validator that proves the dictionary still matches the live warehouse. Pair with sql-generator.md (the consumer) and ADR-PROTO-004 (in-prompt-context, not vector DB).
What's in data-dictionary/¶
Five files. Edit ONLY through pull requests reviewed by the data engineering owner.
| File | Schema | Used by |
|---|---|---|
| table_catalog.csv | One row per (schema, table): grain, row counts, primary/business/partition keys, generation order. |
dictionary_loader._load_table_catalog |
| column_dictionary.csv | One row per (schema, table, column): type (the 'string; varchar' style — see sql-generator.md § Type mapper), role, FK reference, AI usage notes. |
dictionary_loader._load_column_dictionary |
| join_map.csv | One row per approved join. The SQL generator MUST NOT invent joins outside this set (per ai_query_guidelines.md rule #1). |
dictionary_loader._load_join_map |
| kpi_catalog.csv | Pre-curated KPI definitions with ready-to-template formula_sql. |
dictionary_loader._load_kpi_catalog |
| ai_query_guidelines.md | Human-curated SQL generation rules (joins via join_map only, prefer claim-level KPIs from ev_claim, use COUNT(DISTINCT claim_id) for volume, etc.). |
Surfaced as DataDictionary.guidelines for Prompt 3's Bedrock system prompt. |
Today the dictionary describes 4 schemas (l3_asurion, l3_att, l3_ubreakifix, l3_verizon), 52 tables, 1663 columns, 138 joins, 42 KPIs.
Source-of-truth discipline¶
The dictionary is the single source of truth for column shape, types, and approved joins. This is non-negotiable, and it is rule #6 in CLAUDE.md § Build discipline call-outs:
Adding a metric or a table means adding a row in the CSVs, not hardcoding in Python.
Concretely:
- The l3_asurion seeder (backend/scripts/databricks_mock_data/l3_asurion_main.py) renders DDL on the fly from
column_dictionary.csvviaapp.sql_gen.type_mapper.to_databricks_type. There is no static02_tables.sqlfor l3_asurion. If a column in the dictionary disagrees with what the seeder would build, the seeder fails — drift cannot survive. - The validator (backend/scripts/validate_dictionary.py) checks the dictionary against itself (FK targets resolve, KPI base_tables exist) AND against the live warehouse (every dictionary column appears in
DESCRIBE TABLEwith an equivalent type). - The SQL generator (Prompt 3, not yet in the code) will scope Bedrock prompts via
DataDictionary.subset_for_tables(...)and reject any join outsidejoin_map.csv.
How the dictionary is mounted¶
docker-compose.yml lines 79-87 mount the folder read-only:
DATA_DICTIONARY_ROOT=/app/data-dictionary is exported to the api container so backend/app/sql_gen/dictionary_loader.py finds it. The :ro flag means a routing typo or a CSV edit cannot be quietly fixed inside the container instead of the repo — every change goes through git.
Workflow: seed l3_asurion into Databricks¶
make seed-databricks-l3 invokes backend/scripts/databricks_mock_data/l3_asurion_main.py.
flowchart TB
a["data-dictionary/table_catalog.csv<br/>+ column_dictionary.csv"]
b["dictionary_loader.get_default()"]
c["For each FQN in TABLES_TO_SEED:<br/>render CREATE OR REPLACE TABLE"]
d["app.sql_gen.type_mapper.to_databricks_type<br/>(per column)"]
e["l3_asurion_generators<br/>generate_product_catalog_rows (500)<br/>generate_claim_rows (5000)"]
f["uploader.insert_rows<br/>(reused from v1 seeder)"]
dbx[("workspace.l3_asurion.ev_product_catalog<br/>workspace.l3_asurion.ev_claim")]
a --> b --> c
c --> d
c --> e
d --> f
e --> f
f --> dbx
Currently seeded tables (in TABLES_TO_SEED):
l3_asurion.ev_product_catalog— 500 rows. The catalog is parent; rows are picked first soev_claim.product_catalog_idcan FK to a real value. Names drawn from a fixed pool of ~20 device models so grouping queries return readable buckets.l3_asurion.ev_claim— 5000 rows. Status mix weighted (RESOLVED 55%, IN_PROGRESS 18%, PENDING_PARTS 7%, …) so the demo bar chart isn't pathologically uniform. Timestamps within the last 90 days so thetime_window_startfilter on the Databricks-routed metrics returns non-empty results.
--reset is idempotent: drops + recreates both tables. The whole run takes ~35s against a Serverless Starter warehouse.
Adding a third seeded table is one line in TABLES_TO_SEED plus a generator function. The DDL writes itself.
Workflow: validate the dictionary¶
make validate-dictionary invokes backend/scripts/validate_dictionary.py.
Structural pass (always runs)¶
- Every
column_dictionary.foreign_key_referenceresolves to a realschema.table.columntriple in the dictionary itself. - Every
kpi_catalog.base_tableexists intable_catalog. - Schemas listed in
table_catalogbut absent fromcolumn_dictionaryprintWARN: schema sparse.
Today: 0 errors, 0 warnings against the real dictionary.
Live pass (runs when DATABRICKS_HOST is set)¶
For each FQN in DATABRICKS_VALIDATE_TABLES (default: l3_asurion.ev_claim,l3_asurion.ev_product_catalog):
- Run
DESCRIBE TABLEvia backend/app/databricks/client.py. - Assert every dictionary-declared column is present.
- Assert types are equivalent per
app.sql_gen.type_mapper.equivalent.
Tables described in the dictionary but not in the seeded subset print WARN: described but not seeded. Not a failure — full Asurion EDW parity is a Phase-2 follow-up. Override the seeded subset via DATABRICKS_VALIDATE_TABLES=schema.table,....
Exit semantics¶
Exit 0 if structural pass green and live pass green-or-warn-only. Non-zero on any structural failure or live mismatch within the seeded subset.
If the live pass fails because the seeded subset isn't populated (run without make seed-databricks-l3), the validator reports each error with Did you run \make seed-databricks-l3` first?` rather than crashing — see the hardened error handler in backend/scripts/validate_dictionary.py.
How the dictionary feeds the SQL generator¶
The plan in docs/plans/active/part-c-databricks-prototype.md (Prompt 3) wires this:
- The data resolver receives a
DataIntentfrom the frontend. - It loads
metrics_catalog.source_queryfor the requested metric. - It calls
DataDictionary.subset_for_tables(["l3_asurion.ev_claim", ...])to scope the dictionary to only the tables the metric touches. - It splices
subset.guidelines(verbatimai_query_guidelines.md) into the Bedrock system prompt so the generator gets the human-curated SQL rules. - It runs the generated SQL through the safety layer (SELECT-only, allowlisted tables from
subset.tables, allowlisted joins fromsubset.joins). - It executes against Databricks via
app.databricks.client.
The dictionary is the only context the generator sees about the warehouse. There is no vector DB lookup, no schema introspection, no embeddings. ADR-PROTO-004 deliberately chose this: the catalog is small enough (~1.6k columns) that an in-prompt slice is the right shape.
Sparse-schema policy¶
A schema present in table_catalog.csv but absent from column_dictionary.csv is recorded as sparse. The loader warns; the validator surfaces it. Today no schema is sparse. backend/tests/test_dictionary_loader.py test_no_sparse_schemas_today asserts the empty list so a future revision dropping a schema's columns surfaces here as a test failure rather than a silent runtime warning.
Type translation reference¶
Full table in sql-generator.md § Type mapper. The high-bits:
- Pick the FIRST half of
'a; b'—'string; varchar'→STRING. The second half documents the source-system alias. decimal(38,0)is parametric; precision/scale preserved.- An unknown type raises
ValueErrorso dictionary drift fails the seeder/validator immediately.
Cross-references¶
- data-dictionary/ai_query_guidelines.md — the source for
DataDictionary.guidelines. - sql-generator.md — the consumer.
- ADR-PROTO-004 — why the dictionary is in prompt context, not a vector DB.
- docs/plans/active/prompt-2-dictionary-loader.md — execution log for the loader + seeder + validator.