Prompt 2 — Data Dictionary Loader (local CSVs) + l3_asurion seeding · plan v2¶
Status: active. Refined v2 of the original Prompt 2 spec from
prompts.mdlines 132-179. Differences from the prompts.md version: - Local CSVs, not S3 — ingest fromdata-dictionary/mounted into the api container.boto3not added. -l3_asurion.*is added alongsideasurion_prototype.*— health endpoint stays pointed at the v1 mirror; SQL gen routes to the new dictionary-shaped tables. - Three new metric names —claim_volume_l3_asurion,claims_by_product_l3_asurion,claim_status_mix_l3_asurion(replacingcost_avoided_mtd/issue_volume_by_region/repair_replace_conversionwhich don't map to the dictionary'sev_*schema).
Why a v2 plan was needed¶
A first pass at this work missed several non-trivial gaps. They're documented here so the next executing agent doesn't re-discover them mid-implementation.
| Gap | Where it would have bitten | Fix in v2 |
|---|---|---|
DDL was specified as ALTER TABLE ... ADD COLUMN IF NOT EXISTS against db/init.sql only |
Project convention is "edit CREATE TABLE; demo-reset to pick up" (see promote-metric-direction-to-catalog.md ddl_extend). Also app/metrics/catalog.py::_TABLE_DDL is a duplicate CREATE that runs via ensure_metrics_table in lifespan — touching only init.sql would leave dev DBs without the new cols. |
Edit BOTH the CREATE TABLE in db/init.sql AND _TABLE_DDL in catalog.py. Document the dual source. |
MetricEntity = Literal[...] extension to dotted names like 'l3_asurion.ev_claim' |
14+ usages in app/widgets/llm.py, spec_synthesizer.py, plus frontend types.ts:57, MetricInfoBadge.tsx:91, SpecJsonView.tsx:69. Dotted strings render badly in popover; frontend Literal mirror would need the same edit |
Keep entity bare. Add source_schema as new field. |
data_type parsing — dictionary uses 'string; varchar', 'decimal(38,0)', 'timestamp; timestamp(3)', 'bigint', 'int; integer' |
Without a mapper, the seeder's hand-written DDL might use the wrong type names; validate_dictionary.py's "type matches" check has no canonical answer |
Add backend/app/sql_gen/type_mapper.py — to_databricks_type(raw) + equivalent(a, b). Used by both seeder and validator. |
Existing 10 metrics' formula is "compact pseudo-SQL" per ADR-007, not executable |
If we set source_query to the same pseudo-SQL, the future SQL-gen templating would break |
Set source_table on the v1 metrics (e.g. 'issue_sessions'); leave source_query NULL. Only the 3 Databricks-routed metrics get parameterized executable SQL. |
ai_query_guidelines.md ignored — 4th file in data-dictionary/, contains SQL-generation rules ("Prefer claim-level KPIs from ev_claim", "use COUNT(DISTINCT claim_id)") |
These are gold for Prompt 3's Bedrock system prompt; if the loader doesn't expose them, Prompt 3 silently ignores the only piece of human-curated guidance the dictionary provides | DataDictionary.guidelines: str populated from this file at load time. |
verify-acceptance.sh line 73: assert len(d['kpi_strip']) == 5 |
Adding 3 metrics doesn't change kpi_strip (it's a layout doc, not catalog count); line 35's >= 10 becomes 13. ✓ |
No action — confirmed safe. |
l3_ubreakifix schema is sparse |
n/a today | Loader still records sparse_schemas; test_no_sparse_schemas_today asserts the empty list so a future regression surfaces. |
Type semicolon ambiguity ('string; varchar') — pick first or second? |
Inconsistent choice → drift between seeder DDL and dictionary | Type mapper picks the FIRST half (string), Databricks-canonical. The second half is documentation of the source-system alias. |
Two coexisting Databricks workspaces (asurion_prototype keeps the v1 mirror, l3_asurion gets dictionary-shaped tables) |
Health endpoint hits asurion_prototype.customers; SQL gen will hit l3_asurion.ev_claim. Validator must scope its DESCRIBE TABLE calls correctly. |
validate_dictionary.py reads DATABRICKS_CATALOG (not the schema env) and iterates l3_* schemas in the dictionary — separate code path from health. |
Architecture (after Prompt 2)¶
flowchart LR
subgraph repo[2026-hackathon repo]
dict["data-dictionary/<br/>4 CSVs + guidelines.md"]
configYaml["config/metric_routing.yaml"]
initSql["db/init.sql<br/>(canonical metrics_catalog DDL)"]
l3Ddl["backend/databricks_schema/l3_asurion/02_tables.sql"]
end
subgraph api[FastAPI api container]
loader["app.sql_gen.dictionary_loader<br/>(LRU-cached)"]
routing["app.sql_gen.routing<br/>(boot validator)"]
catalog[("Postgres<br/>metrics_catalog<br/>13 rows")]
lifespan["main.lifespan:<br/>seed → validate_routing → yield"]
end
subgraph dbx[Databricks workspace]
asuProto["workspace.asurion_prototype.*<br/>(v1 mirror, /databricks/health)"]
l3Asurion["workspace.l3_asurion.*<br/>(ev_claim 5000, ev_product_catalog 500)"]
end
seederL3["scripts.databricks_mock_data.l3_asurion_main"]
validator["scripts.validate_dictionary"]
dict -->|mount ro| loader
initSql -->|seeds| catalog
configYaml -->|loaded by| routing
catalog -->|cross-checked| routing
routing -->|passes| lifespan
seederL3 -->|reads| dict
seederL3 -->|writes| l3Asurion
l3Ddl -->|applied by| seederL3
validator -->|reads| dict
validator -->|DESCRIBE TABLE| l3Asurion
Files to create¶
| Path | Purpose |
|---|---|
backend/app/sql_gen/__init__.py |
Package marker. |
backend/app/sql_gen/data_dictionary.py |
Pydantic models — ColumnSchema, TableSchema, JoinKey, KpiDefinition, DataDictionary. subset_for_tables(["l3_asurion.ev_claim", ...]) for per-request prompt slicing. guidelines: str carries ai_query_guidelines.md for Prompt 3. |
backend/app/sql_gen/dictionary_loader.py |
stdlib-csv loader rooted at DATA_DICTIONARY_ROOT. LRU-cached on (root, mtime). Reads all 4 CSVs + ai_query_guidelines.md. Public: load_dictionary(root) + get_default(). |
backend/app/sql_gen/type_mapper.py |
Translates column_dictionary.data_type → Databricks DDL type. Picks first half of 'a; b'. equivalent(a, b) for validator's type-match check. |
backend/app/sql_gen/routing.py |
Loads config/metric_routing.yaml. validate_routing_against_catalog(conn) -> None raises on unmapped metric_id. Inverse check is WARN-only. |
config/metric_routing.yaml |
13 entries. Keyed by metric name (snake_case). Schema: routing.<name>.{backend, sql_template, cache_seconds}. |
backend/databricks_schema/l3_asurion/01_create_schema.sql |
CREATE SCHEMA IF NOT EXISTS workspace.l3_asurion. |
backend/databricks_schema/l3_asurion/02_tables.sql |
DDL for ev_claim (29 cols) + ev_product_catalog (23 cols), types via type_mapper. CREATE OR REPLACE TABLE. |
backend/scripts/databricks_mock_data/l3_asurion_generators.py |
Hand-written generators. Column names sourced from column_dictionary.csv. Hkey values are bigint-shaped fakes. Timestamps last 90d. product_catalog_id chosen from seeded ev_product_catalog. |
backend/scripts/databricks_mock_data/l3_asurion_main.py |
CLI mirror of main.py but for schema='l3_asurion'. |
backend/scripts/validate_dictionary.py |
Two-pass validator. |
backend/tests/test_dictionary_loader.py |
Pure-Python: 4 schemas, 17 tables, 140 joins, 43 KPIs; subset filtering; orphan FK detection; sparse schema handling. |
backend/tests/test_routing_validator.py |
Boot validator raises on unmapped, passes baseline, reverse-check WARNs. |
Files to modify¶
| Path | Change |
|---|---|
db/init.sql |
Edit CREATE TABLE metrics_catalog (lines 138-152) — add source_schema, source_table, source_query, last_validated_at, validation_status (default 'valid'), governance_status (default 'approved'), approved_by. NO ALTER TABLE. |
backend/app/metrics/catalog.py |
Update _TABLE_DDL constant (lines 18-34) to mirror db/init.sql exactly — dual source of truth. Update _row_to_catalog, list_metrics, get_metric, get_metric_by_name, create_metric, upsert_metric_from_definition, metric_definition_from_catalog. |
backend/app/metrics/schemas.py |
Add 'ev_claim', 'ev_product_catalog' to MetricEntity Literal (BARE names). Add optional lineage fields to MetricDefinition/CreateMetricBody/CatalogMetric. |
backend/app/metrics/seed.py |
Append the 3 new metrics. Add source_table to v1 metrics; leave source_query=None for them. |
backend/app/main.py |
Import + call validate_routing_against_catalog(conn) AFTER seed_metrics_if_empty. |
backend/requirements.txt |
+PyYAML. |
docker-compose.yml |
+ ./data-dictionary:/app/data-dictionary:ro mount; +DATA_DICTIONARY_ROOT env. |
backend/app/settings.py |
+ data_dictionary_root: str = "/app/data-dictionary". |
Makefile |
+ seed-databricks-l3 target. |
frontend/src/widgets/types.ts |
Mirror new MetricEntity literals + optional lineage fields. |
docs/plans/active/part-c-databricks-prototype.md |
Edit prompt_2_dictionary_loader todo. Add Build-discipline note. |
Files NOT touched¶
- The 4 CSVs +
ai_query_guidelines.mdindata-dictionary/— canonical, untouched. backend/databricks_schema/{01,02,03}*.sql— v1 mirror DDL stays.backend/scripts/databricks_mock_data/{generators,uploader,main}.py— v1 seeder stays.- v1 metrics seed semantics (formula / definition / label) — only adding lineage fields.
backend/app/widgets/llm.py,spec_synthesizer.py— entity literals stay bare.frontend/src/widgets/MetricInfoBadge.tsx,SpecJsonView.tsx— Prompt 5 territory.scripts/verify-acceptance.sh— assertions remain valid.
Acceptance gates (in order)¶
make demo-reset && make up— fresh Postgres pullsdb/init.sqlwith new lineage cols. Lifespan seeds 13 metrics. Routing validator passes.docker exec 2026-hackathon-api-1 env | grep DATA_DICTIONARY_ROOT— returns/app/data-dictionary.make seed-databricks-l3—workspace.l3_asurion.ev_claim5000 rows;ev_product_catalog500 rows. Idempotent with--reset.docker exec 2026-hackathon-api-1 python -m scripts.validate_dictionary— exit 0. Seeded subset green; unseeded tables WARN. Satisfies plan acceptance #3.docker exec 2026-hackathon-api-1 pytest backend/tests/test_dictionary_loader.py backend/tests/test_routing_validator.py -v— green.bash scripts/verify-acceptance.sh— still green (the>=10catalog assertion holds at 13).- Fail-loud manual —
psqlin, INSERT a metric_id with no routing entry, restart api, confirmRuntimeError: Unmapped metric_ids in metrics_catalog: ['orphan_metric']in logs.
Time-box¶
90 minutes. Cut order if overrun:
1. load_dictionary_local.py CLI.
2. Frontend types.ts widening (defer to Prompt 5).
3. Live pass of validate_dictionary.py (keep structural pass).
Non-negotiables: dictionary loader, routing YAML + validator wired into lifespan, the 3 seed metrics with lineage, l3_asurion seeder producing rows.
Execution log (session 2026-05-06)¶
Status: implementation complete; last gate (make demo-reset && make seed-databricks-l3 && make validate-dictionary) is the user's to run on a fresh stack. All code, config, tests, and DDL are in place. Pytest is green inside the running api container (44/44 passing — 19 new + 25 pre-existing).
What landed (verified in-container):
backend/app/sql_gen/{__init__.py, data_dictionary.py, dictionary_loader.py, type_mapper.py, routing.py}— all green, structural-pass validator finds 0 errors / 0 warnings against the real dictionary (52 tables, 1663 columns, 138 joins, 42 KPIs).db/init.sql+backend/app/metrics/catalog.py::_TABLE_DDL— dual-edit landed; lineage cols + CHECK constraints present in both.backend/app/metrics/{schemas.py, seed.py, catalog.py}—MetricEntitywidened withev_claim+ev_product_catalog(bare names); 13CreateMetricBodyentries (10 v1 + 3 Databricks); CRUD reads/writes lineage cols.config/metric_routing.yaml— 13 entries; loader resolves viaMETRIC_ROUTING_PATHenv or repo walk-up.backend/app/main.py— lifespan now callsvalidate_routing_against_catalog(conn)after seeders; container exits non-zero on unmapped metric_id.backend/scripts/databricks_mock_data/{l3_asurion_generators.py, l3_asurion_main.py}— DDL rendered on the fly from the dictionary viato_databricks_type; no static02_tables.sqlfile (eliminated drift surface).backend/scripts/validate_dictionary.py+make validate-dictionary— structural pass green; live pass surfaces clean ERRORs (with hint to runmake seed-databricks-l3) when seeded subset is absent.docker-compose.yml—./data-dictionary:/app/data-dictionary:roand./config:/app/config:romounts;DATA_DICTIONARY_ROOT+METRIC_ROUTING_PATHenv passthrough.backend/app/settings.py—data_dictionary_rootfield added.frontend/src/widgets/types.ts—MetricEntityextended; lineage fields optional + backward-compatible.Makefile—seed-databricks-l3andvalidate-dictionarytargets added;.PHONYupdated.backend/tests/{test_dictionary_loader.py, test_routing_validator.py}— 19 tests, all green.
Acceptance gates remaining (user-facing):
make demo-reset && make up— fresh Postgres pulls newinit.sqlcols; lifespan seeds 13 metrics; routing validator passes.make seed-databricks-l3— populatesworkspace.l3_asurion.{ev_product_catalog, ev_claim}.make validate-dictionary— exits 0 once gate 2 has run (today exits 1 because seeded subset is absent — by design).bash scripts/verify-acceptance.sh— should remain green; the catalog-count assertion holds at 13 ≥ 10.- Fail-loud manual —
psqlin,INSERT INTO metrics_catalog (name, label, definition, formula, entity) VALUES ('orphan_metric', ...), restart api; expectRuntimeError: Unmapped metric_ids in metrics_catalog: ['orphan_metric'].
Cross-references¶
prompts.mdPrompt 2 lines 132-179 — original spec.docs/plans/active/part-c-databricks-prototype.mdprompt_2_dictionary_loadertodo.docs/plans/active/promote-metric-direction-to-catalog.md— establishes dual-DDL-source convention.docs/plans/completed/databricks-mock-data-and-prompt-1.md— pattern for the new l3_asurion seeder.databricks-mock-data.md— seeder structural pattern.prd-v2.1.md§C.5.3 + ADR-PROTO-005 — boot-time routing validation; per-metric routing.backend/app/metrics/catalog.py_TABLE_DDL— dual source withdb/init.sql.