Skip to content

Prompt 2 — Data Dictionary Loader (local CSVs) + l3_asurion seeding · plan v2

Status: active. Refined v2 of the original Prompt 2 spec from prompts.md lines 132-179. Differences from the prompts.md version: - Local CSVs, not S3 — ingest from data-dictionary/ mounted into the api container. boto3 not added. - l3_asurion.* is added alongside asurion_prototype.* — health endpoint stays pointed at the v1 mirror; SQL gen routes to the new dictionary-shaped tables. - Three new metric namesclaim_volume_l3_asurion, claims_by_product_l3_asurion, claim_status_mix_l3_asurion (replacing cost_avoided_mtd / issue_volume_by_region / repair_replace_conversion which don't map to the dictionary's ev_* schema).

Why a v2 plan was needed

A first pass at this work missed several non-trivial gaps. They're documented here so the next executing agent doesn't re-discover them mid-implementation.

Gap Where it would have bitten Fix in v2
DDL was specified as ALTER TABLE ... ADD COLUMN IF NOT EXISTS against db/init.sql only Project convention is "edit CREATE TABLE; demo-reset to pick up" (see promote-metric-direction-to-catalog.md ddl_extend). Also app/metrics/catalog.py::_TABLE_DDL is a duplicate CREATE that runs via ensure_metrics_table in lifespan — touching only init.sql would leave dev DBs without the new cols. Edit BOTH the CREATE TABLE in db/init.sql AND _TABLE_DDL in catalog.py. Document the dual source.
MetricEntity = Literal[...] extension to dotted names like 'l3_asurion.ev_claim' 14+ usages in app/widgets/llm.py, spec_synthesizer.py, plus frontend types.ts:57, MetricInfoBadge.tsx:91, SpecJsonView.tsx:69. Dotted strings render badly in popover; frontend Literal mirror would need the same edit Keep entity bare. Add source_schema as new field.
data_type parsing — dictionary uses 'string; varchar', 'decimal(38,0)', 'timestamp; timestamp(3)', 'bigint', 'int; integer' Without a mapper, the seeder's hand-written DDL might use the wrong type names; validate_dictionary.py's "type matches" check has no canonical answer Add backend/app/sql_gen/type_mapper.pyto_databricks_type(raw) + equivalent(a, b). Used by both seeder and validator.
Existing 10 metrics' formula is "compact pseudo-SQL" per ADR-007, not executable If we set source_query to the same pseudo-SQL, the future SQL-gen templating would break Set source_table on the v1 metrics (e.g. 'issue_sessions'); leave source_query NULL. Only the 3 Databricks-routed metrics get parameterized executable SQL.
ai_query_guidelines.md ignored — 4th file in data-dictionary/, contains SQL-generation rules ("Prefer claim-level KPIs from ev_claim", "use COUNT(DISTINCT claim_id)") These are gold for Prompt 3's Bedrock system prompt; if the loader doesn't expose them, Prompt 3 silently ignores the only piece of human-curated guidance the dictionary provides DataDictionary.guidelines: str populated from this file at load time.
verify-acceptance.sh line 73: assert len(d['kpi_strip']) == 5 Adding 3 metrics doesn't change kpi_strip (it's a layout doc, not catalog count); line 35's >= 10 becomes 13. ✓ No action — confirmed safe.
l3_ubreakifix schema is sparseresolved during execution: all 4 schemas (l3_asurion, l3_att, l3_ubreakifix, l3_verizon) are fully described in the current dictionary; sparse-handling is dormant code that will only fire if a future revision drops a schema's column rows n/a today Loader still records sparse_schemas; test_no_sparse_schemas_today asserts the empty list so a future regression surfaces.
Type semicolon ambiguity ('string; varchar') — pick first or second? Inconsistent choice → drift between seeder DDL and dictionary Type mapper picks the FIRST half (string), Databricks-canonical. The second half is documentation of the source-system alias.
Two coexisting Databricks workspaces (asurion_prototype keeps the v1 mirror, l3_asurion gets dictionary-shaped tables) Health endpoint hits asurion_prototype.customers; SQL gen will hit l3_asurion.ev_claim. Validator must scope its DESCRIBE TABLE calls correctly. validate_dictionary.py reads DATABRICKS_CATALOG (not the schema env) and iterates l3_* schemas in the dictionary — separate code path from health.

Architecture (after Prompt 2)

flowchart LR
    subgraph repo[2026-hackathon repo]
        dict["data-dictionary/<br/>4 CSVs + guidelines.md"]
        configYaml["config/metric_routing.yaml"]
        initSql["db/init.sql<br/>(canonical metrics_catalog DDL)"]
        l3Ddl["backend/databricks_schema/l3_asurion/02_tables.sql"]
    end
    subgraph api[FastAPI api container]
        loader["app.sql_gen.dictionary_loader<br/>(LRU-cached)"]
        routing["app.sql_gen.routing<br/>(boot validator)"]
        catalog[("Postgres<br/>metrics_catalog<br/>13 rows")]
        lifespan["main.lifespan:<br/>seed → validate_routing → yield"]
    end
    subgraph dbx[Databricks workspace]
        asuProto["workspace.asurion_prototype.*<br/>(v1 mirror, /databricks/health)"]
        l3Asurion["workspace.l3_asurion.*<br/>(ev_claim 5000, ev_product_catalog 500)"]
    end
    seederL3["scripts.databricks_mock_data.l3_asurion_main"]
    validator["scripts.validate_dictionary"]

    dict -->|mount ro| loader
    initSql -->|seeds| catalog
    configYaml -->|loaded by| routing
    catalog -->|cross-checked| routing
    routing -->|passes| lifespan
    seederL3 -->|reads| dict
    seederL3 -->|writes| l3Asurion
    l3Ddl -->|applied by| seederL3
    validator -->|reads| dict
    validator -->|DESCRIBE TABLE| l3Asurion

Files to create

Path Purpose
backend/app/sql_gen/__init__.py Package marker.
backend/app/sql_gen/data_dictionary.py Pydantic models — ColumnSchema, TableSchema, JoinKey, KpiDefinition, DataDictionary. subset_for_tables(["l3_asurion.ev_claim", ...]) for per-request prompt slicing. guidelines: str carries ai_query_guidelines.md for Prompt 3.
backend/app/sql_gen/dictionary_loader.py stdlib-csv loader rooted at DATA_DICTIONARY_ROOT. LRU-cached on (root, mtime). Reads all 4 CSVs + ai_query_guidelines.md. Public: load_dictionary(root) + get_default().
backend/app/sql_gen/type_mapper.py Translates column_dictionary.data_type → Databricks DDL type. Picks first half of 'a; b'. equivalent(a, b) for validator's type-match check.
backend/app/sql_gen/routing.py Loads config/metric_routing.yaml. validate_routing_against_catalog(conn) -> None raises on unmapped metric_id. Inverse check is WARN-only.
config/metric_routing.yaml 13 entries. Keyed by metric name (snake_case). Schema: routing.<name>.{backend, sql_template, cache_seconds}.
backend/databricks_schema/l3_asurion/01_create_schema.sql CREATE SCHEMA IF NOT EXISTS workspace.l3_asurion.
backend/databricks_schema/l3_asurion/02_tables.sql DDL for ev_claim (29 cols) + ev_product_catalog (23 cols), types via type_mapper. CREATE OR REPLACE TABLE.
backend/scripts/databricks_mock_data/l3_asurion_generators.py Hand-written generators. Column names sourced from column_dictionary.csv. Hkey values are bigint-shaped fakes. Timestamps last 90d. product_catalog_id chosen from seeded ev_product_catalog.
backend/scripts/databricks_mock_data/l3_asurion_main.py CLI mirror of main.py but for schema='l3_asurion'.
backend/scripts/validate_dictionary.py Two-pass validator.
backend/tests/test_dictionary_loader.py Pure-Python: 4 schemas, 17 tables, 140 joins, 43 KPIs; subset filtering; orphan FK detection; sparse schema handling.
backend/tests/test_routing_validator.py Boot validator raises on unmapped, passes baseline, reverse-check WARNs.

Files to modify

Path Change
db/init.sql Edit CREATE TABLE metrics_catalog (lines 138-152) — add source_schema, source_table, source_query, last_validated_at, validation_status (default 'valid'), governance_status (default 'approved'), approved_by. NO ALTER TABLE.
backend/app/metrics/catalog.py Update _TABLE_DDL constant (lines 18-34) to mirror db/init.sql exactly — dual source of truth. Update _row_to_catalog, list_metrics, get_metric, get_metric_by_name, create_metric, upsert_metric_from_definition, metric_definition_from_catalog.
backend/app/metrics/schemas.py Add 'ev_claim', 'ev_product_catalog' to MetricEntity Literal (BARE names). Add optional lineage fields to MetricDefinition/CreateMetricBody/CatalogMetric.
backend/app/metrics/seed.py Append the 3 new metrics. Add source_table to v1 metrics; leave source_query=None for them.
backend/app/main.py Import + call validate_routing_against_catalog(conn) AFTER seed_metrics_if_empty.
backend/requirements.txt +PyYAML.
docker-compose.yml + ./data-dictionary:/app/data-dictionary:ro mount; +DATA_DICTIONARY_ROOT env.
backend/app/settings.py + data_dictionary_root: str = "/app/data-dictionary".
Makefile + seed-databricks-l3 target.
frontend/src/widgets/types.ts Mirror new MetricEntity literals + optional lineage fields.
docs/plans/active/part-c-databricks-prototype.md Edit prompt_2_dictionary_loader todo. Add Build-discipline note.

Files NOT touched

  • The 4 CSVs + ai_query_guidelines.md in data-dictionary/ — canonical, untouched.
  • backend/databricks_schema/{01,02,03}*.sql — v1 mirror DDL stays.
  • backend/scripts/databricks_mock_data/{generators,uploader,main}.py — v1 seeder stays.
  • v1 metrics seed semantics (formula / definition / label) — only adding lineage fields.
  • backend/app/widgets/llm.py, spec_synthesizer.py — entity literals stay bare.
  • frontend/src/widgets/MetricInfoBadge.tsx, SpecJsonView.tsx — Prompt 5 territory.
  • scripts/verify-acceptance.sh — assertions remain valid.

Acceptance gates (in order)

  1. make demo-reset && make up — fresh Postgres pulls db/init.sql with new lineage cols. Lifespan seeds 13 metrics. Routing validator passes.
  2. docker exec 2026-hackathon-api-1 env | grep DATA_DICTIONARY_ROOT — returns /app/data-dictionary.
  3. make seed-databricks-l3workspace.l3_asurion.ev_claim 5000 rows; ev_product_catalog 500 rows. Idempotent with --reset.
  4. docker exec 2026-hackathon-api-1 python -m scripts.validate_dictionary — exit 0. Seeded subset green; unseeded tables WARN. Satisfies plan acceptance #3.
  5. docker exec 2026-hackathon-api-1 pytest backend/tests/test_dictionary_loader.py backend/tests/test_routing_validator.py -v — green.
  6. bash scripts/verify-acceptance.sh — still green (the >=10 catalog assertion holds at 13).
  7. Fail-loud manualpsql in, INSERT a metric_id with no routing entry, restart api, confirm RuntimeError: Unmapped metric_ids in metrics_catalog: ['orphan_metric'] in logs.

Time-box

90 minutes. Cut order if overrun: 1. load_dictionary_local.py CLI. 2. Frontend types.ts widening (defer to Prompt 5). 3. Live pass of validate_dictionary.py (keep structural pass).

Non-negotiables: dictionary loader, routing YAML + validator wired into lifespan, the 3 seed metrics with lineage, l3_asurion seeder producing rows.

Execution log (session 2026-05-06)

Status: implementation complete; last gate (make demo-reset && make seed-databricks-l3 && make validate-dictionary) is the user's to run on a fresh stack. All code, config, tests, and DDL are in place. Pytest is green inside the running api container (44/44 passing — 19 new + 25 pre-existing).

What landed (verified in-container):

  • backend/app/sql_gen/{__init__.py, data_dictionary.py, dictionary_loader.py, type_mapper.py, routing.py} — all green, structural-pass validator finds 0 errors / 0 warnings against the real dictionary (52 tables, 1663 columns, 138 joins, 42 KPIs).
  • db/init.sql + backend/app/metrics/catalog.py::_TABLE_DDL — dual-edit landed; lineage cols + CHECK constraints present in both.
  • backend/app/metrics/{schemas.py, seed.py, catalog.py}MetricEntity widened with ev_claim + ev_product_catalog (bare names); 13 CreateMetricBody entries (10 v1 + 3 Databricks); CRUD reads/writes lineage cols.
  • config/metric_routing.yaml — 13 entries; loader resolves via METRIC_ROUTING_PATH env or repo walk-up.
  • backend/app/main.py — lifespan now calls validate_routing_against_catalog(conn) after seeders; container exits non-zero on unmapped metric_id.
  • backend/scripts/databricks_mock_data/{l3_asurion_generators.py, l3_asurion_main.py} — DDL rendered on the fly from the dictionary via to_databricks_type; no static 02_tables.sql file (eliminated drift surface).
  • backend/scripts/validate_dictionary.py + make validate-dictionary — structural pass green; live pass surfaces clean ERRORs (with hint to run make seed-databricks-l3) when seeded subset is absent.
  • docker-compose.yml./data-dictionary:/app/data-dictionary:ro and ./config:/app/config:ro mounts; DATA_DICTIONARY_ROOT + METRIC_ROUTING_PATH env passthrough.
  • backend/app/settings.pydata_dictionary_root field added.
  • frontend/src/widgets/types.tsMetricEntity extended; lineage fields optional + backward-compatible.
  • Makefileseed-databricks-l3 and validate-dictionary targets added; .PHONY updated.
  • backend/tests/{test_dictionary_loader.py, test_routing_validator.py} — 19 tests, all green.

Acceptance gates remaining (user-facing):

  1. make demo-reset && make up — fresh Postgres pulls new init.sql cols; lifespan seeds 13 metrics; routing validator passes.
  2. make seed-databricks-l3 — populates workspace.l3_asurion.{ev_product_catalog, ev_claim}.
  3. make validate-dictionary — exits 0 once gate 2 has run (today exits 1 because seeded subset is absent — by design).
  4. bash scripts/verify-acceptance.sh — should remain green; the catalog-count assertion holds at 13 ≥ 10.
  5. Fail-loud manualpsql in, INSERT INTO metrics_catalog (name, label, definition, formula, entity) VALUES ('orphan_metric', ...), restart api; expect RuntimeError: Unmapped metric_ids in metrics_catalog: ['orphan_metric'].

Cross-references