Prompt 2 — Data Dictionary Loader (local CSVs) + l3_asurion seeding · plan v2¶

Status: active. Refined v2 of the original Prompt 2 spec from prompts.md lines 132-179. Differences from the prompts.md version: - Local CSVs, not S3 — ingest from data-dictionary/ mounted into the api container. boto3 not added. - l3_asurion.* is added alongside asurion_prototype.* — health endpoint stays pointed at the v1 mirror; SQL gen routes to the new dictionary-shaped tables. - Three new metric names — claim_volume_l3_asurion, claims_by_product_l3_asurion, claim_status_mix_l3_asurion (replacing cost_avoided_mtd / issue_volume_by_region / repair_replace_conversion which don't map to the dictionary's ev_* schema).

Why a v2 plan was needed¶

A first pass at this work missed several non-trivial gaps. They're documented here so the next executing agent doesn't re-discover them mid-implementation.

Gap	Where it would have bitten	Fix in v2
DDL was specified as `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` against `db/init.sql` only	Project convention is "edit CREATE TABLE; demo-reset to pick up" (see `promote-metric-direction-to-catalog.md` `ddl_extend`). Also `app/metrics/catalog.py::_TABLE_DDL` is a duplicate CREATE that runs via `ensure_metrics_table` in lifespan — touching only `init.sql` would leave dev DBs without the new cols.	Edit BOTH the CREATE TABLE in `db/init.sql` AND `_TABLE_DDL` in `catalog.py`. Document the dual source.
`MetricEntity = Literal[...]` extension to dotted names like `'l3_asurion.ev_claim'`	14+ usages in `app/widgets/llm.py`, `spec_synthesizer.py`, plus frontend `types.ts:57`, `MetricInfoBadge.tsx:91`, `SpecJsonView.tsx:69`. Dotted strings render badly in popover; frontend Literal mirror would need the same edit	Keep `entity` bare. Add `source_schema` as new field.
`data_type` parsing — dictionary uses `'string; varchar'`, `'decimal(38,0)'`, `'timestamp; timestamp(3)'`, `'bigint'`, `'int; integer'`	Without a mapper, the seeder's hand-written DDL might use the wrong type names; `validate_dictionary.py`'s "type matches" check has no canonical answer	Add `backend/app/sql_gen/type_mapper.py` — `to_databricks_type(raw)` + `equivalent(a, b)`. Used by both seeder and validator.
Existing 10 metrics' `formula` is "compact pseudo-SQL" per ADR-007, not executable	If we set `source_query` to the same pseudo-SQL, the future SQL-gen templating would break	Set `source_table` on the v1 metrics (e.g. `'issue_sessions'`); leave `source_query` NULL. Only the 3 Databricks-routed metrics get parameterized executable SQL.
`ai_query_guidelines.md` ignored — 4^th file in `data-dictionary/`, contains SQL-generation rules ("Prefer claim-level KPIs from `ev_claim`", "use `COUNT(DISTINCT claim_id)`")	These are gold for Prompt 3's Bedrock system prompt; if the loader doesn't expose them, Prompt 3 silently ignores the only piece of human-curated guidance the dictionary provides	`DataDictionary.guidelines: str` populated from this file at load time.
`verify-acceptance.sh` line 73: `assert len(d['kpi_strip']) == 5`	Adding 3 metrics doesn't change kpi_strip (it's a layout doc, not catalog count); line 35's `>= 10` becomes 13. ✓	No action — confirmed safe.
~~`l3_ubreakifix` schema is sparse~~ — resolved during execution: all 4 schemas (l3_asurion, l3_att, l3_ubreakifix, l3_verizon) are fully described in the current dictionary; sparse-handling is dormant code that will only fire if a future revision drops a schema's column rows	n/a today	Loader still records `sparse_schemas`; `test_no_sparse_schemas_today` asserts the empty list so a future regression surfaces.
Type semicolon ambiguity (`'string; varchar'`) — pick first or second?	Inconsistent choice → drift between seeder DDL and dictionary	Type mapper picks the FIRST half (`string`), Databricks-canonical. The second half is documentation of the source-system alias.
Two coexisting Databricks workspaces (`asurion_prototype` keeps the v1 mirror, `l3_asurion` gets dictionary-shaped tables)	Health endpoint hits `asurion_prototype.customers`; SQL gen will hit `l3_asurion.ev_claim`. Validator must scope its DESCRIBE TABLE calls correctly.	`validate_dictionary.py` reads `DATABRICKS_CATALOG` (not the schema env) and iterates `l3_*` schemas in the dictionary — separate code path from health.

Architecture (after Prompt 2)¶

flowchart LR
    subgraph repo[2026-hackathon repo]
        dict["data-dictionary/<br/>4 CSVs + guidelines.md"]
        configYaml["config/metric_routing.yaml"]
        initSql["db/init.sql<br/>(canonical metrics_catalog DDL)"]
        l3Ddl["backend/databricks_schema/l3_asurion/02_tables.sql"]
    end
    subgraph api[FastAPI api container]
        loader["app.sql_gen.dictionary_loader<br/>(LRU-cached)"]
        routing["app.sql_gen.routing<br/>(boot validator)"]
        catalog[("Postgres<br/>metrics_catalog<br/>13 rows")]
        lifespan["main.lifespan:<br/>seed → validate_routing → yield"]
    end
    subgraph dbx[Databricks workspace]
        asuProto["workspace.asurion_prototype.*<br/>(v1 mirror, /databricks/health)"]
        l3Asurion["workspace.l3_asurion.*<br/>(ev_claim 5000, ev_product_catalog 500)"]
    end
    seederL3["scripts.databricks_mock_data.l3_asurion_main"]
    validator["scripts.validate_dictionary"]

    dict -->|mount ro| loader
    initSql -->|seeds| catalog
    configYaml -->|loaded by| routing
    catalog -->|cross-checked| routing
    routing -->|passes| lifespan
    seederL3 -->|reads| dict
    seederL3 -->|writes| l3Asurion
    l3Ddl -->|applied by| seederL3
    validator -->|reads| dict
    validator -->|DESCRIBE TABLE| l3Asurion

Files to create¶

Path	Purpose
`backend/app/sql_gen/__init__.py`	Package marker.
`backend/app/sql_gen/data_dictionary.py`	Pydantic models — `ColumnSchema`, `TableSchema`, `JoinKey`, `KpiDefinition`, `DataDictionary`. `subset_for_tables(["l3_asurion.ev_claim", ...])` for per-request prompt slicing. `guidelines: str` carries `ai_query_guidelines.md` for Prompt 3.
`backend/app/sql_gen/dictionary_loader.py`	stdlib-csv loader rooted at `DATA_DICTIONARY_ROOT`. LRU-cached on `(root, mtime)`. Reads all 4 CSVs + `ai_query_guidelines.md`. Public: `load_dictionary(root)` + `get_default()`.
`backend/app/sql_gen/type_mapper.py`	Translates `column_dictionary.data_type` → Databricks DDL type. Picks first half of `'a; b'`. `equivalent(a, b)` for validator's type-match check.
`backend/app/sql_gen/routing.py`	Loads `config/metric_routing.yaml`. `validate_routing_against_catalog(conn) -> None` raises on unmapped `metric_id`. Inverse check is WARN-only.
`config/metric_routing.yaml`	13 entries. Keyed by metric `name` (snake_case). Schema: `routing.<name>.{backend, sql_template, cache_seconds}`.
`backend/databricks_schema/l3_asurion/01_create_schema.sql`	`CREATE SCHEMA IF NOT EXISTS workspace.l3_asurion`.
`backend/databricks_schema/l3_asurion/02_tables.sql`	DDL for `ev_claim` (29 cols) + `ev_product_catalog` (23 cols), types via `type_mapper`. `CREATE OR REPLACE TABLE`.
`backend/scripts/databricks_mock_data/l3_asurion_generators.py`	Hand-written generators. Column names sourced from `column_dictionary.csv`. Hkey values are bigint-shaped fakes. Timestamps last 90d. `product_catalog_id` chosen from seeded `ev_product_catalog`.
`backend/scripts/databricks_mock_data/l3_asurion_main.py`	CLI mirror of `main.py` but for `schema='l3_asurion'`.
`backend/scripts/validate_dictionary.py`	Two-pass validator.
`backend/tests/test_dictionary_loader.py`	Pure-Python: 4 schemas, 17 tables, 140 joins, 43 KPIs; subset filtering; orphan FK detection; sparse schema handling.
`backend/tests/test_routing_validator.py`	Boot validator raises on unmapped, passes baseline, reverse-check WARNs.

Files to modify¶

Path	Change
`db/init.sql`	Edit `CREATE TABLE metrics_catalog` (lines 138-152) — add `source_schema`, `source_table`, `source_query`, `last_validated_at`, `validation_status` (default `'valid'`), `governance_status` (default `'approved'`), `approved_by`. NO `ALTER TABLE`.
`backend/app/metrics/catalog.py`	Update `_TABLE_DDL` constant (lines 18-34) to mirror `db/init.sql` exactly — dual source of truth. Update `_row_to_catalog`, `list_metrics`, `get_metric`, `get_metric_by_name`, `create_metric`, `upsert_metric_from_definition`, `metric_definition_from_catalog`.
`backend/app/metrics/schemas.py`	Add `'ev_claim'`, `'ev_product_catalog'` to `MetricEntity` Literal (BARE names). Add optional lineage fields to `MetricDefinition`/`CreateMetricBody`/`CatalogMetric`.
`backend/app/metrics/seed.py`	Append the 3 new metrics. Add `source_table` to v1 metrics; leave `source_query=None` for them.
`backend/app/main.py`	Import + call `validate_routing_against_catalog(conn)` AFTER `seed_metrics_if_empty`.
`backend/requirements.txt`	`+PyYAML`.
`docker-compose.yml`	`+ ./data-dictionary:/app/data-dictionary:ro` mount; `+DATA_DICTIONARY_ROOT` env.
`backend/app/settings.py`	`+ data_dictionary_root: str = "/app/data-dictionary"`.
`Makefile`	`+ seed-databricks-l3` target.
`frontend/src/widgets/types.ts`	Mirror new `MetricEntity` literals + optional lineage fields.
`docs/plans/active/part-c-databricks-prototype.md`	Edit `prompt_2_dictionary_loader` todo. Add Build-discipline note.

Files NOT touched¶

The 4 CSVs + ai_query_guidelines.md in data-dictionary/ — canonical, untouched.
backend/databricks_schema/{01,02,03}*.sql — v1 mirror DDL stays.
backend/scripts/databricks_mock_data/{generators,uploader,main}.py — v1 seeder stays.
v1 metrics seed semantics (formula / definition / label) — only adding lineage fields.
backend/app/widgets/llm.py, spec_synthesizer.py — entity literals stay bare.
frontend/src/widgets/MetricInfoBadge.tsx, SpecJsonView.tsx — Prompt 5 territory.
scripts/verify-acceptance.sh — assertions remain valid.

Acceptance gates (in order)¶

make demo-reset && make up — fresh Postgres pulls db/init.sql with new lineage cols. Lifespan seeds 13 metrics. Routing validator passes.
docker exec 2026-hackathon-api-1 env | grep DATA_DICTIONARY_ROOT — returns /app/data-dictionary.
make seed-databricks-l3 — workspace.l3_asurion.ev_claim 5000 rows; ev_product_catalog 500 rows. Idempotent with --reset.
docker exec 2026-hackathon-api-1 python -m scripts.validate_dictionary — exit 0. Seeded subset green; unseeded tables WARN. Satisfies plan acceptance #3.
docker exec 2026-hackathon-api-1 pytest backend/tests/test_dictionary_loader.py backend/tests/test_routing_validator.py -v — green.
bash scripts/verify-acceptance.sh — still green (the >=10 catalog assertion holds at 13).
Fail-loud manual — psql in, INSERT a metric_id with no routing entry, restart api, confirm RuntimeError: Unmapped metric_ids in metrics_catalog: ['orphan_metric'] in logs.

Time-box¶

90 minutes. Cut order if overrun: 1. load_dictionary_local.py CLI. 2. Frontend types.ts widening (defer to Prompt 5). 3. Live pass of validate_dictionary.py (keep structural pass).

Non-negotiables: dictionary loader, routing YAML + validator wired into lifespan, the 3 seed metrics with lineage, l3_asurion seeder producing rows.

Execution log (session 2026-05-06)¶

Status: implementation complete; last gate (make demo-reset && make seed-databricks-l3 && make validate-dictionary) is the user's to run on a fresh stack. All code, config, tests, and DDL are in place. Pytest is green inside the running api container (44/44 passing — 19 new + 25 pre-existing).

What landed (verified in-container):

backend/app/sql_gen/{__init__.py, data_dictionary.py, dictionary_loader.py, type_mapper.py, routing.py} — all green, structural-pass validator finds 0 errors / 0 warnings against the real dictionary (52 tables, 1663 columns, 138 joins, 42 KPIs).
db/init.sql + backend/app/metrics/catalog.py::_TABLE_DDL — dual-edit landed; lineage cols + CHECK constraints present in both.
backend/app/metrics/{schemas.py, seed.py, catalog.py} — MetricEntity widened with ev_claim + ev_product_catalog (bare names); 13 CreateMetricBody entries (10 v1 + 3 Databricks); CRUD reads/writes lineage cols.
config/metric_routing.yaml — 13 entries; loader resolves via METRIC_ROUTING_PATH env or repo walk-up.
backend/app/main.py — lifespan now calls validate_routing_against_catalog(conn) after seeders; container exits non-zero on unmapped metric_id.
backend/scripts/databricks_mock_data/{l3_asurion_generators.py, l3_asurion_main.py} — DDL rendered on the fly from the dictionary via to_databricks_type; no static 02_tables.sql file (eliminated drift surface).
backend/scripts/validate_dictionary.py + make validate-dictionary — structural pass green; live pass surfaces clean ERRORs (with hint to run make seed-databricks-l3) when seeded subset is absent.
docker-compose.yml — ./data-dictionary:/app/data-dictionary:ro and ./config:/app/config:ro mounts; DATA_DICTIONARY_ROOT + METRIC_ROUTING_PATH env passthrough.
backend/app/settings.py — data_dictionary_root field added.
frontend/src/widgets/types.ts — MetricEntity extended; lineage fields optional + backward-compatible.
Makefile — seed-databricks-l3 and validate-dictionary targets added; .PHONY updated.
backend/tests/{test_dictionary_loader.py, test_routing_validator.py} — 19 tests, all green.

Acceptance gates remaining (user-facing):

make demo-reset && make up — fresh Postgres pulls new init.sql cols; lifespan seeds 13 metrics; routing validator passes.
make seed-databricks-l3 — populates workspace.l3_asurion.{ev_product_catalog, ev_claim}.
make validate-dictionary — exits 0 once gate 2 has run (today exits 1 because seeded subset is absent — by design).
bash scripts/verify-acceptance.sh — should remain green; the catalog-count assertion holds at 13 ≥ 10.
Fail-loud manual — psql in, INSERT INTO metrics_catalog (name, label, definition, formula, entity) VALUES ('orphan_metric', ...), restart api; expect RuntimeError: Unmapped metric_ids in metrics_catalog: ['orphan_metric'].

Cross-references¶

prompts.md Prompt 2 lines 132-179 — original spec.
docs/plans/active/part-c-databricks-prototype.md prompt_2_dictionary_loader todo.
docs/plans/active/promote-metric-direction-to-catalog.md — establishes dual-DDL-source convention.
docs/plans/completed/databricks-mock-data-and-prompt-1.md — pattern for the new l3_asurion seeder.
databricks-mock-data.md — seeder structural pattern.
prd-v2.1.md §C.5.3 + ADR-PROTO-005 — boot-time routing validation; per-metric routing.
backend/app/metrics/catalog.py _TABLE_DDL — dual source with db/init.sql.