Data Dictionary & Ingestion Architecture¶
Authoritative source: data-dictionary/, backend/app/sql_gen/dictionary_loader.py
Four CSV files define every table, column, join, and KPI in the Asurion EDW. Those same files create the Databricks warehouse tables, prove they match, and scope every LLM prompt at query time. The dictionary is the only view the SQL generator has of the warehouse — there is no vector DB, no schema introspection, no embeddings (ADR-PROTO-004).
Why Command Center does this¶
Generic LLM-to-SQL tools let the model hallucinate joins and reference columns that don't exist. Command Center anchors every generated query to a curated dictionary: the LLM sees only approved joins from join_map.csv, only validated columns from column_dictionary.csv, and only pre-curated KPI formulas from kpi_catalog.csv. The full dictionary fits in Claude's context window, so there's no retrieval step to tune or debug — per ADR-PROTO-004: "Load the entire data dictionary into a Python module, cache in memory at boot, inject the relevant subset into the Bedrock prompt context per request."
The full picture¶
flowchart LR
subgraph dict["Data Dictionary (4 CSVs + guidelines)"]
tc["table_catalog.csv\n52 tables"]
cd["column_dictionary.csv\n1,664 columns"]
jm["join_map.csv\n138 approved joins"]
kc["kpi_catalog.csv\n42 KPIs"]
guide["ai_query_guidelines.md"]
end
loader["dictionary_loader.py\nLRU-cached DataDictionary"]
subgraph seed["Seeding (make seed-databricks-l3)"]
tm["type_mapper.py\nstring;varchar → STRING"]
ddl["render CREATE TABLE\n(on-the-fly from CSVs)"]
gen["l3_asurion_generators.py\n500 product rows\n5,000 claim rows"]
end
dbx[("Databricks SQL Warehouse\nworkspace.l3_asurion")]
subgraph val["Validation (make validate-dictionary)"]
s1["Structural pass\nFK resolution\nKPI base tables"]
s2["Live pass\nDESCRIBE TABLE\ntype equivalence"]
end
subgraph runtime["Runtime (POST /v1/widgets/sql/generate)"]
sub["subset_for_tables()\nscoped slice"]
prompt["Bedrock prompt\nmetric + subset + guidelines"]
safety["safety.py\nSELECT-only, allowlist,\nLIMIT injection"]
end
dict --> loader
loader --> tm --> ddl --> dbx
loader --> gen --> dbx
dbx --> s2
dict --> s1
loader --> sub --> prompt --> safety --> dbx
style dict fill:#f0f4ff,stroke:#4a6fa5
style seed fill:#fff4e6,stroke:#c47f17
style val fill:#e8f5e9,stroke:#2e7d32
style runtime fill:#fce4ec,stroke:#c62828
The dictionary participates in three distinct phases — each described below.
The dictionary: 4 files, 1 truth¶
erDiagram
table_catalog ||--|{ column_dictionary : "schema + table_name"
table_catalog ||--o{ join_map : "from_table / to_table"
table_catalog ||--o{ kpi_catalog : "base_table + schema_scope"
column_dictionary ||--o{ join_map : "from_column / to_column"
table_catalog {
string schema "l3_asurion, l3_att, l3_verizon, l3_ubreakifix"
string table_name "ev_claim, ev_product_catalog, ..."
string grain "one row per claim"
int row_count "5000"
string primary_key_columns "claim_hkey"
string client_isolation_rule "Do not join across schemas"
}
column_dictionary {
string column_name "cost_avoided_usd"
string data_type "decimal(15,2)"
string column_role "metric"
string foreign_key_reference "l3_asurion.ev_claim_key.claim_id"
string ai_usage_notes "Good candidate for SUM aggregations"
}
join_map {
string from_table "ev_claim"
string to_table "ev_claim_key"
string relationship_type "many_to_one"
string confidence "high"
string ai_join_rule "Join only within same schema"
}
kpi_catalog {
string kpi_name "Claim Volume"
string formula_sql "COUNT(DISTINCT claim_id)"
string date_column "claim_started_dttm_utc"
string default_grain "month"
}
ai_query_guidelines.md is the fifth file — a prose document surfaced verbatim into the Bedrock system prompt as DataDictionary.guidelines. It enforces: joins via join_map.csv only, per-schema data isolation, COUNT(DISTINCT claim_id) for volume KPIs, and aggregated chart-ready output (no raw row dumps).
dictionary_loader.py reads all five files with stdlib csv.DictReader, builds a DataDictionary Pydantic model (data_dictionary.py), and LRU-caches the result on (root, mtime_signature). Edits to the CSVs are picked up on the next call without a container restart.
From dictionary to warehouse¶
flowchart TB
csv["table_catalog.csv + column_dictionary.csv\n52 tables, 1,664 columns"]
loader["dictionary_loader.get_default()"]
loop["For each FQN in TABLES_TO_SEED"]
subgraph ddl_path["DDL Generation"]
mapper["type_mapper.to_databricks_type()\nstring;varchar → STRING\ndecimal(15,2) → DECIMAL(15,2)\ntimestamp;timestamp(3) → TIMESTAMP"]
create["render_create_table()\nCREATE OR REPLACE TABLE\nfrom dictionary columns"]
end
subgraph data_path["Data Generation"]
gen_prod["generate_product_catalog_rows()\n~500 rows, 20 device models"]
gen_claim["generate_claim_rows()\n~5,000 rows, FK to products\nweighted status mix:\nRESOLVED 55% | IN_PROGRESS 18%\nPENDING_PARTS 7% | ..."]
end
upload["uploader.insert_rows()"]
dbx_tables[("workspace.l3_asurion\n.ev_product_catalog (500 rows)\n.ev_claim (5,000 rows)")]
csv --> loader --> loop
loop --> mapper --> create --> upload
loop --> gen_prod --> upload
loop --> gen_claim --> upload
upload --> dbx_tables
style ddl_path fill:#fff4e6,stroke:#c47f17
style data_path fill:#e3f2fd,stroke:#1565c0
make seed-databricks-l3 runs l3_asurion_main.py. There is no static SQL file for table definitions — DDL is rendered on the fly from the dictionary via type_mapper.py. If a column type in the CSV doesn't map to a Databricks type, to_databricks_type() raises ValueError and the seed fails immediately. The --reset flag drops and recreates both tables; the whole run takes ~35s against a Serverless Starter warehouse.
Proving the dictionary matches reality¶
flowchart TB
subgraph pass1["Pass 1: Structural (always runs, offline)"]
fk["FK resolution\nevery foreign_key_reference\nresolves to a real column"]
kpi["KPI base tables\nevery kpi_catalog.base_table\nexists in table_catalog"]
sparse["Sparse schema detection\nschemas in table_catalog\nmissing from column_dictionary"]
end
subgraph pass2["Pass 2: Live (when DATABRICKS_HOST is set)"]
desc["DESCRIBE TABLE\nfor each seeded FQN"]
col_check["Column presence\nevery dictionary column\nmust appear in DESCRIBE"]
type_check["Type equivalence\ntype_mapper.equivalent()\ndictionary type == warehouse type"]
end
dict["data-dictionary/*.csv"] --> pass1
dbx[("Databricks\nworkspace.l3_asurion")] --> desc
desc --> col_check --> type_check
result{{"Exit 0: all green\nExit non-zero: structural failure\nor live type mismatch"}}
pass1 --> result
type_check --> result
style pass1 fill:#e8f5e9,stroke:#2e7d32
style pass2 fill:#fff3e0,stroke:#e65100
make validate-dictionary runs validate_dictionary.py. The structural pass works entirely from the CSV files — no Databricks connection needed. The live pass runs DESCRIBE TABLE against each seeded table and asserts every dictionary-declared column is present with an equivalent type (via type_mapper.equivalent()). Tables in the dictionary but outside the seeded subset produce a WARN, not an error — full EDW parity is Phase 2.
At query time: scoped context, not the whole catalog¶
sequenceDiagram
participant FE as Frontend
participant API as Data Resolver
participant Route as metric_routing.yaml
participant Dict as DataDictionary
participant LLM as Bedrock (Sonnet 4.6)
participant Safe as safety.py
participant DBX as Databricks
FE->>API: POST /v1/widgets/{id}/data
API->>Route: lookup metric_name
Route-->>API: backend: databricks, cache: 300s
API->>Dict: subset_for_tables(["l3_asurion.ev_claim"])
Dict-->>API: scoped tables + joins + KPIs + guidelines
API->>LLM: system prompt (subset + guidelines)<br/>+ metric + few-shot examples
LLM-->>API: {sql, tables_used, explanation}
API->>Safe: validate(sql, allowlist)
Note over Safe: SELECT-only check<br/>table allowlist check<br/>LIMIT injection/cap<br/>no DDL/DML
Safe-->>API: SafetyResult(is_safe=true, modified_sql)
API->>DBX: execute(modified_sql)
DBX-->>API: rows + columns
API-->>FE: {data, source: "bedrock", generated_sql, ...}
The 1,664-column dictionary never hits the LLM as a whole. DataDictionary.subset_for_tables() (data_dictionary.py) returns only the tables, joins, and KPIs relevant to the requested metric. prompts.py splices this subset plus verbatim ai_query_guidelines.md into the Bedrock system prompt. After generation, safety.py re-derives the table set from the SQL AST (via sqlglot, dialect databricks) — it does not trust the LLM's tables_used claim.
Components¶
| Component | File | Role |
|---|---|---|
| Dictionary loader | backend/app/sql_gen/dictionary_loader.py |
Parses CSVs into DataDictionary, LRU-cached |
| Data dictionary model | backend/app/sql_gen/data_dictionary.py |
DataDictionary, TableSchema, ColumnSchema, JoinKey, KpiDefinition |
| Type mapper | backend/app/sql_gen/type_mapper.py |
to_databricks_type(), equivalent() |
| Seeder | backend/scripts/databricks_mock_data/l3_asurion_main.py |
Renders DDL from dictionary, generates rows, uploads |
| Generators | backend/scripts/databricks_mock_data/l3_asurion_generators.py |
Deterministic synthetic data (product catalog, claims) |
| Validator | backend/scripts/validate_dictionary.py |
Two-pass: structural + live warehouse check |
| SQL generator | backend/app/sql_gen/generator.py |
Orchestrates metric resolution, prompt build, LLM call, safety, execution |
| Safety layer | backend/app/sql_gen/safety.py |
5-check validation via sqlglot (SELECT-only, table allowlist, LIMIT, no DDL/DML) |
| Routing config | config/metric_routing.yaml |
Maps each metric to postgres or databricks backend |
| Databricks client | backend/app/databricks/client.py |
Pooled SQL Warehouse client with typed exceptions |
| Data resolver | backend/app/widgets/data_resolver.py |
Dispatches to postgres or databricks based on routing |
Interface contracts¶
| Boundary | Schema | File |
|---|---|---|
| CSV → Python | DataDictionary (Pydantic) |
data_dictionary.py |
| Python → Databricks DDL | to_databricks_type() return |
type_mapper.py |
| LLM → Safety | {sql, tables_used, explanation} tool output |
generator.py |
| Safety → Execution | SafetyResult with modified_sql |
safety.py |
| Resolver → Frontend | DataResolverResponse |
data_resolver.py |
Out of scope¶
- Metrics catalog and boot validation — covered in data-dictionary.md and sql-generator.md.
- Cache layer — cache.py handles Redis TTL caching; not part of the dictionary flow.
- Frontend rendering —
SourceBadge,MetricInfoBadge, and the Generated SQL tab consume the response but don't interact with the dictionary. - Cross-schema queries — the dictionary describes 4 schemas but only
l3_asurionis seeded today; cross-schema UNION queries are Phase 2.