Databricks mock-data ingest + Prompt 1 (connectivity)¶

Status: completed 2026-05-06. All Phase A / B / C acceptance checks green. prompt_1_databricks_client and acceptance_dryrun_2 marked complete in part-c-databricks-prototype.md.

Final results: - Phase A: 3 SQL files (backend/databricks_schema/). - Phase B: make seed-databricks runs in 205s (under MD-2's 5-min budget). Row counts: customers 5000, devices 7582, issue_sessions 20000, claims 10326, ai_recommendations 20000, outcomes 15257, operations_context 48. Validation views return realistic shapes (CA/TX/FL/NY top regions, ~50% repair rate, $777K cost_avoided_mtd). - Phase C: GET /v1/databricks/health returns 200 / rows_sampled=5000 in ~1s warm. Negative path (bogus host) returns 503 application/problem+json kind=databricks_unavailable in 0.7s. 9/9 client tests pass (pytest -m databricks_live and unit). Full suite 24/24, no regressions.

Volume deviation from MD-2: issue_sessions defaulted to 20K instead of 50K — the Free Trial Serverless Starter warehouse couldn't push 50K via batched INSERT VALUES within MD-2's 5-min wall budget. Acceptance check B2's active_issues ~15K therefore landed at ~4K instead. The qualitative shape (regional skew, conversion rate, weekly cycle) is preserved; only the absolute volume is reduced. Documented in generators.py module docstring + Lessons-learned candidate below.

Why this ordering¶

Phase B's --reset mode runs the Phase A DDL itself, so the user does not have to paste SQL into the Databricks UI manually. By the time Phase C runs curl /v1/databricks/health, the table exists with seeded, business-correlated data, and the acceptance check is meaningful.

Phase A (MD-1) ──► Phase B (MD-2) ──► Phase C (Prompt 1) ──► Acceptance: rows_sampled = 5000
schema DDL        Python generator     FastAPI client +
files (no run)    + uploader runs      /v1/databricks/health
                  the DDL on --reset

Phase B and Phase C share the databricks-sql-connector install — adding it once in Phase B's requirements.txt edit covers both.

Cross-references — read before starting¶

Reference	Why
`databricks-mock-data.md` §§ MD-1, MD-2, MD-3	Source of truth for the schema, the generator distributions, and the env block.
`docs/plans/active/part-c-databricks-prototype.md` `prompt_1_databricks_client`	The original Prompt 1 todo that this plan executes (subset of).
`prompts.md` Prompt 1 (lines 86-128)	The literal Prompt 1 spec — file list, behavior, acceptance.
`prompts.md` § Time-box discipline (line 459)	Why we ship PAT-only and defer OAuth M2M / Service Principal.
`docs/lessons-learned.md`	Five entries directly compliance-checked below.

Confirm Free Trial setup per databricks-mock-data.md § What you need before starting lines 19-29: workspace URL noted, PAT generated, Serverless Starter warehouse provisioned (auto-stop 10 min), schema asurion_prototype created in the workspace catalog (or let Phase B create it via CREATE SCHEMA IF NOT EXISTS).
Add to .env (the host already has DATABRICKS_HOST + DATABRICKS_TOKEN):
1 2 3 4
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/<your-warehouse-id> DATABRICKS_CATALOG=workspace DATABRICKS_SCHEMA=asurion_prototype DATABRICKS_HEALTH_TABLE=customers
Get HTTP_PATH from Databricks UI → SQL Warehouses → your warehouse → Connection details. The https:// already in DATABRICKS_HOST is fine — Phase C's _normalize_host strips it defensively.

Phase A — Schema DDL files (~30 min, mirrors MD-1)¶

Files created (no execution; Phase B's main script runs them):

backend/databricks_schema/01_create_schema.sql — CREATE SCHEMA IF NOT EXISTS workspace.asurion_prototype; USE workspace.asurion_prototype;
backend/databricks_schema/02_tables.sql — 7 CREATE OR REPLACE TABLE statements per MD-1 lines 60-138. Idempotent. Time-series tables PARTITIONED BY DATE(timestamp_col).
backend/databricks_schema/03_validation_views.sql — 4 views (compile against empty tables; populate with real numbers after Phase B):
- vw_kpi_executive_overview
- vw_cost_avoided_by_region_30d
- vw_repair_replace_conversion_30d
- vw_issue_volume_by_region

Phase B — Mock data generator + uploader (~75 min, mirrors MD-2)¶

Adds dev deps + builds the seeded generator + ingests data through batched INSERTs.

New deps (`backend/requirements.txt`)¶

1
2
3

databricks-sql-connector>=3.0,<4.0
Faker>=24.0,<30.0
numpy>=1.26,<3.0

Files created¶

backend/scripts/databricks_mock_data/__init__.py
backend/scripts/databricks_mock_data/generators.py — 7 generator functions per MD-2 lines 179-246. Each takes seed: int, builds numpy.random.RandomState(seed), returns list[dict].
backend/scripts/databricks_mock_data/uploader.py — databricks-sql-connector wrapper. Per table: drop + recreate from Phase A DDL → batched 1000-row INSERTs with progress prints → final SELECT COUNT(*).
backend/scripts/databricks_mock_data/main.py — CLI entry. --reset drops the 7 tables, runs Phase A DDL, generates + uploads all entities, runs the 4 validation views and prints first 5 rows of each.

Makefile¶

seed-databricks:
    docker exec -e PYTHONPATH=/app api python -m scripts.databricks_mock_data.main --reset

Phase B acceptance (MD-2 lines 276-284)¶

#	Check
B1	`make seed-databricks` completes in <5 min with no errors
B2	`vw_kpi_executive_overview` shows `active_issues ~15,000`, `claims_in_progress ~5,000`, `cost_avoided_mtd $400K-$1.2M`
B3	`vw_cost_avoided_by_region_30d` returns 12 rows; CA/TX/FL/NY at top
B4	`vw_repair_replace_conversion_30d` shows 60-65% repair rate (not 100%, not 0%)
B5	Re-running `make seed-databricks` produces identical row counts (seed determinism)

Phase C — FastAPI client + /v1/databricks/health (~90 min, mirrors Prompt 1)¶

Files created¶

backend/app/databricks/__init__.py
backend/app/databricks/exceptions.py — DatabricksError base + 4 subclasses
backend/app/databricks/client.py — _normalize_host, _auth_kwargs (PAT only), pooled connections, execute()
backend/app/databricks/health.py — run_health_check() against {catalog}.{schema}.{health_table} with identifier validation
backend/tests/test_databricks_client.py — 3 tests (1 marked databricks_live, 2 unit)

Files modified¶

backend/app/settings.py — 11 new DATABRICKS_* fields
backend/app/main.py — GET /v1/databricks/health returning 503 application/problem+json on DatabricksError
docker-compose.yml — pass through 11 vars with ${VAR:-default} syntax
Makefile — databricks-health target
backend/tests/conftest.py — register databricks_live marker

Phase C acceptance gate¶

#	Check
C1	`make up` boots the stack; `docker exec api env \\| grep DATABRICKS_` shows all 11 vars
C2	`curl http://localhost:8000/v1/databricks/health` returns 200 with `rows_sampled = 5000`
C3	`make databricks-health` works as a shorthand for #C2
C4	`docker exec api pytest backend/tests/test_databricks_client.py -v -m databricks_live` passes
C5	Negative path: bogus `DATABRICKS_HTTP_PATH` → 503 with `Content-Type: application/problem+json` and structured body — no stack trace, no 500

Lessons-learned compliance¶

§ Stale containers hide UI work — make up after every requirements.txt / settings.py / compose / main.py edit before testing.
§ Watch the live env vars on make up, not just the file — gate before declaring health green: docker exec api env | grep DATABRICKS_ shows all 11 vars.
§ Mocks must be opt-in, never silent fallback — no MockDatabricks. Connection failure = 503 + application/problem+json. Mirrors ADR-008.
§ Bake auxiliary code into the image — backend/app/databricks/ lives under already-COPY'd app/; backend/scripts/databricks_mock_data/ lives under already-COPY'd scripts/. No Dockerfile edits needed.
§ pydantic-settings parses empty env strings strictly for booleans — avoided by typing all new settings as str / int / Literal.

Out of scope (deferred)¶

Item	Where it belongs
`seed.py` `source_query` mappings for the 3 demo metrics	Prompt 2 (next plan) — MD-3 lines 315-345
`config/data_dictionary.yaml` hand-write for the 7 tables	Prompt 2 — MD-3 lines 307-313
`config/metric_routing.yaml` per-metric routing	Prompt 2 — MD-3 lines 347-348
`metrics_catalog` lineage column ALTERs	Prompt 2
SQL Generator service + safety layer + `sqlglot`	Prompt 3
`POST /v1/widgets/{id}/data` resolver + Redis cache + per-metric routing	Prompt 4
Frontend `useWidgetData` + `SourceBadge` + `MetricInfoBadge` enhancements	Prompt 5
Demo runbook + architecture diagram + whats-mocked doc	Prompt 6
OAuth M2M / Service Principal auth wiring	Time-box deferred per `prompts.md` line 459; dispatch shape in place

Done when¶

All 5 Phase B acceptance checks + all 5 Phase C acceptance checks pass.
prompt_1_databricks_client and acceptance_dryrun_2 are marked complete in part-c-databricks-prototype.md.
This plan moves to docs/plans/completed/databricks-mock-data-and-prompt-1.md with status: completed and completed_on: <date>.

Databricks mock-data ingest + Prompt 1 (connectivity)¶

Why this ordering¶

Cross-references — read before starting¶

Step 0 — User actions before execution¶

Phase A — Schema DDL files (~30 min, mirrors MD-1)¶

Phase B — Mock data generator + uploader (~75 min, mirrors MD-2)¶

New deps (backend/requirements.txt)¶

Files created¶

Makefile¶

Phase B acceptance (MD-2 lines 276-284)¶

Phase C — FastAPI client + /v1/databricks/health (~90 min, mirrors Prompt 1)¶

Files created¶

Files modified¶

Phase C acceptance gate¶

Lessons-learned compliance¶

Out of scope (deferred)¶

Done when¶

New deps (`backend/requirements.txt`)¶