Skip to content

Data Dictionary

Authoritative for the data-dictionary/ folder, the seeders that consume it, and the validator that proves the dictionary still matches the live warehouse. Pair with sql-generator.md (the consumer) and ADR-PROTO-004 (in-prompt-context, not vector DB).

What's in data-dictionary/

Five files. Edit ONLY through pull requests reviewed by the data engineering owner.

File Schema Used by
table_catalog.csv One row per (schema, table): grain, row counts, primary/business/partition keys, generation order. dictionary_loader._load_table_catalog
column_dictionary.csv One row per (schema, table, column): type (the 'string; varchar' style — see sql-generator.md § Type mapper), role, FK reference, AI usage notes. dictionary_loader._load_column_dictionary
join_map.csv One row per approved join. The SQL generator MUST NOT invent joins outside this set (per ai_query_guidelines.md rule #1). dictionary_loader._load_join_map
kpi_catalog.csv Pre-curated KPI definitions with ready-to-template formula_sql. dictionary_loader._load_kpi_catalog
ai_query_guidelines.md Human-curated SQL generation rules (joins via join_map only, prefer claim-level KPIs from ev_claim, use COUNT(DISTINCT claim_id) for volume, etc.). Surfaced as DataDictionary.guidelines for Prompt 3's Bedrock system prompt.

Today the dictionary describes 4 schemas (l3_asurion, l3_att, l3_ubreakifix, l3_verizon), 52 tables, 1663 columns, 138 joins, 42 KPIs.

Source-of-truth discipline

The dictionary is the single source of truth for column shape, types, and approved joins. This is non-negotiable, and it is rule #6 in CLAUDE.md § Build discipline call-outs:

Adding a metric or a table means adding a row in the CSVs, not hardcoding in Python.

Concretely:

  • The l3_asurion seeder (backend/scripts/databricks_mock_data/l3_asurion_main.py) renders DDL on the fly from column_dictionary.csv via app.sql_gen.type_mapper.to_databricks_type. There is no static 02_tables.sql for l3_asurion. If a column in the dictionary disagrees with what the seeder would build, the seeder fails — drift cannot survive.
  • The validator (backend/scripts/validate_dictionary.py) checks the dictionary against itself (FK targets resolve, KPI base_tables exist) AND against the live warehouse (every dictionary column appears in DESCRIBE TABLE with an equivalent type).
  • The SQL generator (Prompt 3, not yet in the code) will scope Bedrock prompts via DataDictionary.subset_for_tables(...) and reject any join outside join_map.csv.

How the dictionary is mounted

docker-compose.yml lines 79-87 mount the folder read-only:

volumes:
  - ./data-dictionary:/app/data-dictionary:ro

DATA_DICTIONARY_ROOT=/app/data-dictionary is exported to the api container so backend/app/sql_gen/dictionary_loader.py finds it. The :ro flag means a routing typo or a CSV edit cannot be quietly fixed inside the container instead of the repo — every change goes through git.

Workflow: seed l3_asurion into Databricks

make seed-databricks-l3 invokes backend/scripts/databricks_mock_data/l3_asurion_main.py.

flowchart TB
    a["data-dictionary/table_catalog.csv<br/>+ column_dictionary.csv"]
    b["dictionary_loader.get_default()"]
    c["For each FQN in TABLES_TO_SEED:<br/>render CREATE OR REPLACE TABLE"]
    d["app.sql_gen.type_mapper.to_databricks_type<br/>(per column)"]
    e["l3_asurion_generators<br/>generate_product_catalog_rows (500)<br/>generate_claim_rows (5000)"]
    f["uploader.insert_rows<br/>(reused from v1 seeder)"]
    dbx[("workspace.l3_asurion.ev_product_catalog<br/>workspace.l3_asurion.ev_claim")]

    a --> b --> c
    c --> d
    c --> e
    d --> f
    e --> f
    f --> dbx

Currently seeded tables (in TABLES_TO_SEED):

  • l3_asurion.ev_product_catalog — 500 rows. The catalog is parent; rows are picked first so ev_claim.product_catalog_id can FK to a real value. Names drawn from a fixed pool of ~20 device models so grouping queries return readable buckets.
  • l3_asurion.ev_claim — 5000 rows. Status mix weighted (RESOLVED 55%, IN_PROGRESS 18%, PENDING_PARTS 7%, …) so the demo bar chart isn't pathologically uniform. Timestamps within the last 90 days so the time_window_start filter on the Databricks-routed metrics returns non-empty results.

--reset is idempotent: drops + recreates both tables. The whole run takes ~35s against a Serverless Starter warehouse.

Adding a third seeded table is one line in TABLES_TO_SEED plus a generator function. The DDL writes itself.

Workflow: validate the dictionary

make validate-dictionary invokes backend/scripts/validate_dictionary.py.

Structural pass (always runs)

  1. Every column_dictionary.foreign_key_reference resolves to a real schema.table.column triple in the dictionary itself.
  2. Every kpi_catalog.base_table exists in table_catalog.
  3. Schemas listed in table_catalog but absent from column_dictionary print WARN: schema sparse.

Today: 0 errors, 0 warnings against the real dictionary.

Live pass (runs when DATABRICKS_HOST is set)

For each FQN in DATABRICKS_VALIDATE_TABLES (default: l3_asurion.ev_claim,l3_asurion.ev_product_catalog):

  1. Run DESCRIBE TABLE via backend/app/databricks/client.py.
  2. Assert every dictionary-declared column is present.
  3. Assert types are equivalent per app.sql_gen.type_mapper.equivalent.

Tables described in the dictionary but not in the seeded subset print WARN: described but not seeded. Not a failure — full Asurion EDW parity is a Phase-2 follow-up. Override the seeded subset via DATABRICKS_VALIDATE_TABLES=schema.table,....

Exit semantics

Exit 0 if structural pass green and live pass green-or-warn-only. Non-zero on any structural failure or live mismatch within the seeded subset.

If the live pass fails because the seeded subset isn't populated (run without make seed-databricks-l3), the validator reports each error with Did you run \make seed-databricks-l3` first?` rather than crashing — see the hardened error handler in backend/scripts/validate_dictionary.py.

How the dictionary feeds the SQL generator

The plan in docs/plans/active/part-c-databricks-prototype.md (Prompt 3) wires this:

  1. The data resolver receives a DataIntent from the frontend.
  2. It loads metrics_catalog.source_query for the requested metric.
  3. It calls DataDictionary.subset_for_tables(["l3_asurion.ev_claim", ...]) to scope the dictionary to only the tables the metric touches.
  4. It splices subset.guidelines (verbatim ai_query_guidelines.md) into the Bedrock system prompt so the generator gets the human-curated SQL rules.
  5. It runs the generated SQL through the safety layer (SELECT-only, allowlisted tables from subset.tables, allowlisted joins from subset.joins).
  6. It executes against Databricks via app.databricks.client.

The dictionary is the only context the generator sees about the warehouse. There is no vector DB lookup, no schema introspection, no embeddings. ADR-PROTO-004 deliberately chose this: the catalog is small enough (~1.6k columns) that an in-prompt slice is the right shape.

Sparse-schema policy

A schema present in table_catalog.csv but absent from column_dictionary.csv is recorded as sparse. The loader warns; the validator surfaces it. Today no schema is sparse. backend/tests/test_dictionary_loader.py test_no_sparse_schemas_today asserts the empty list so a future revision dropping a schema's columns surfaces here as a test failure rather than a silent runtime warning.

Type translation reference

Full table in sql-generator.md § Type mapper. The high-bits:

  • Pick the FIRST half of 'a; b''string; varchar'STRING. The second half documents the source-system alias.
  • decimal(38,0) is parametric; precision/scale preserved.
  • An unknown type raises ValueError so dictionary drift fails the seeder/validator immediately.

Cross-references