Databricks mock-data ingest + Prompt 1 (connectivity)¶
Status: completed 2026-05-06. All Phase A / B / C acceptance checks green.
prompt_1_databricks_clientandacceptance_dryrun_2marked complete inpart-c-databricks-prototype.md.Final results: - Phase A: 3 SQL files (
backend/databricks_schema/). - Phase B:make seed-databricksruns in 205s (under MD-2's 5-min budget). Row counts: customers 5000, devices 7582, issue_sessions 20000, claims 10326, ai_recommendations 20000, outcomes 15257, operations_context 48. Validation views return realistic shapes (CA/TX/FL/NY top regions, ~50% repair rate, $777K cost_avoided_mtd). - Phase C:GET /v1/databricks/healthreturns 200 /rows_sampled=5000in ~1s warm. Negative path (bogus host) returns 503application/problem+jsonkind=databricks_unavailablein 0.7s. 9/9 client tests pass (pytest -m databricks_liveand unit). Full suite 24/24, no regressions.Volume deviation from MD-2:
issue_sessionsdefaulted to 20K instead of 50K — the Free Trial Serverless Starter warehouse couldn't push 50K via batchedINSERT VALUESwithin MD-2's 5-min wall budget. Acceptance check B2'sactive_issues ~15Ktherefore landed at ~4K instead. The qualitative shape (regional skew, conversion rate, weekly cycle) is preserved; only the absolute volume is reduced. Documented ingenerators.pymodule docstring + Lessons-learned candidate below.
Why this ordering¶
Phase B's --reset mode runs the Phase A DDL itself, so the user does not have to paste SQL into the Databricks UI manually. By the time Phase C runs curl /v1/databricks/health, the table exists with seeded, business-correlated data, and the acceptance check is meaningful.
Phase B and Phase C share the databricks-sql-connector install — adding it once in Phase B's requirements.txt edit covers both.
Cross-references — read before starting¶
| Reference | Why |
|---|---|
databricks-mock-data.md §§ MD-1, MD-2, MD-3 |
Source of truth for the schema, the generator distributions, and the env block. |
docs/plans/active/part-c-databricks-prototype.md prompt_1_databricks_client |
The original Prompt 1 todo that this plan executes (subset of). |
prompts.md Prompt 1 (lines 86-128) |
The literal Prompt 1 spec — file list, behavior, acceptance. |
prompts.md § Time-box discipline (line 459) |
Why we ship PAT-only and defer OAuth M2M / Service Principal. |
docs/lessons-learned.md |
Five entries directly compliance-checked below. |
Step 0 — User actions before execution¶
- Confirm Free Trial setup per
databricks-mock-data.md§ What you need before starting lines 19-29: workspace URL noted, PAT generated, Serverless Starter warehouse provisioned (auto-stop 10 min), schemaasurion_prototypecreated in theworkspacecatalog (or let Phase B create it viaCREATE SCHEMA IF NOT EXISTS). - Add to
.env(the host already hasDATABRICKS_HOST+DATABRICKS_TOKEN):Get HTTP_PATH from Databricks UI → SQL Warehouses → your warehouse → Connection details. Thehttps://already inDATABRICKS_HOSTis fine — Phase C's_normalize_hoststrips it defensively.
Phase A — Schema DDL files (~30 min, mirrors MD-1)¶
Files created (no execution; Phase B's main script runs them):
backend/databricks_schema/01_create_schema.sql—CREATE SCHEMA IF NOT EXISTS workspace.asurion_prototype; USE workspace.asurion_prototype;backend/databricks_schema/02_tables.sql— 7CREATE OR REPLACE TABLEstatements per MD-1 lines 60-138. Idempotent. Time-series tables PARTITIONED BY DATE(timestamp_col).backend/databricks_schema/03_validation_views.sql— 4 views (compile against empty tables; populate with real numbers after Phase B):vw_kpi_executive_overviewvw_cost_avoided_by_region_30dvw_repair_replace_conversion_30dvw_issue_volume_by_region
Phase B — Mock data generator + uploader (~75 min, mirrors MD-2)¶
Adds dev deps + builds the seeded generator + ingests data through batched INSERTs.
New deps (backend/requirements.txt)¶
Files created¶
backend/scripts/databricks_mock_data/__init__.pybackend/scripts/databricks_mock_data/generators.py— 7 generator functions per MD-2 lines 179-246. Each takesseed: int, buildsnumpy.random.RandomState(seed), returnslist[dict].backend/scripts/databricks_mock_data/uploader.py—databricks-sql-connectorwrapper. Per table: drop + recreate from Phase A DDL → batched 1000-row INSERTs with progress prints → finalSELECT COUNT(*).backend/scripts/databricks_mock_data/main.py— CLI entry.--resetdrops the 7 tables, runs Phase A DDL, generates + uploads all entities, runs the 4 validation views and prints first 5 rows of each.
Makefile¶
Phase B acceptance (MD-2 lines 276-284)¶
| # | Check |
|---|---|
| B1 | make seed-databricks completes in <5 min with no errors |
| B2 | vw_kpi_executive_overview shows active_issues ~15,000, claims_in_progress ~5,000, cost_avoided_mtd $400K-$1.2M |
| B3 | vw_cost_avoided_by_region_30d returns 12 rows; CA/TX/FL/NY at top |
| B4 | vw_repair_replace_conversion_30d shows 60-65% repair rate (not 100%, not 0%) |
| B5 | Re-running make seed-databricks produces identical row counts (seed determinism) |
Phase C — FastAPI client + /v1/databricks/health (~90 min, mirrors Prompt 1)¶
Files created¶
backend/app/databricks/__init__.pybackend/app/databricks/exceptions.py—DatabricksErrorbase + 4 subclassesbackend/app/databricks/client.py—_normalize_host,_auth_kwargs(PAT only), pooled connections,execute()backend/app/databricks/health.py—run_health_check()against{catalog}.{schema}.{health_table}with identifier validationbackend/tests/test_databricks_client.py— 3 tests (1 markeddatabricks_live, 2 unit)
Files modified¶
backend/app/settings.py— 11 newDATABRICKS_*fieldsbackend/app/main.py—GET /v1/databricks/healthreturning 503application/problem+jsononDatabricksErrordocker-compose.yml— pass through 11 vars with${VAR:-default}syntaxMakefile—databricks-healthtargetbackend/tests/conftest.py— registerdatabricks_livemarker
Phase C acceptance gate¶
| # | Check |
|---|---|
| C1 | make up boots the stack; docker exec api env \| grep DATABRICKS_ shows all 11 vars |
| C2 | curl http://localhost:8000/v1/databricks/health returns 200 with rows_sampled = 5000 |
| C3 | make databricks-health works as a shorthand for #C2 |
| C4 | docker exec api pytest backend/tests/test_databricks_client.py -v -m databricks_live passes |
| C5 | Negative path: bogus DATABRICKS_HTTP_PATH → 503 with Content-Type: application/problem+json and structured body — no stack trace, no 500 |
Lessons-learned compliance¶
- § Stale containers hide UI work —
make upafter every requirements.txt / settings.py / compose / main.py edit before testing. - § Watch the live env vars on
make up, not just the file — gate before declaring health green:docker exec api env | grep DATABRICKS_shows all 11 vars. - § Mocks must be opt-in, never silent fallback — no
MockDatabricks. Connection failure = 503 +application/problem+json. Mirrors ADR-008. - § Bake auxiliary code into the image —
backend/app/databricks/lives under already-COPY'dapp/;backend/scripts/databricks_mock_data/lives under already-COPY'dscripts/. No Dockerfile edits needed. - § pydantic-settings parses empty env strings strictly for booleans — avoided by typing all new settings as
str/int/Literal.
Out of scope (deferred)¶
| Item | Where it belongs |
|---|---|
seed.py source_query mappings for the 3 demo metrics |
Prompt 2 (next plan) — MD-3 lines 315-345 |
config/data_dictionary.yaml hand-write for the 7 tables |
Prompt 2 — MD-3 lines 307-313 |
config/metric_routing.yaml per-metric routing |
Prompt 2 — MD-3 lines 347-348 |
metrics_catalog lineage column ALTERs |
Prompt 2 |
SQL Generator service + safety layer + sqlglot |
Prompt 3 |
POST /v1/widgets/{id}/data resolver + Redis cache + per-metric routing |
Prompt 4 |
Frontend useWidgetData + SourceBadge + MetricInfoBadge enhancements |
Prompt 5 |
| Demo runbook + architecture diagram + whats-mocked doc | Prompt 6 |
| OAuth M2M / Service Principal auth wiring | Time-box deferred per prompts.md line 459; dispatch shape in place |
Done when¶
- All 5 Phase B acceptance checks + all 5 Phase C acceptance checks pass.
prompt_1_databricks_clientandacceptance_dryrun_2are marked complete inpart-c-databricks-prototype.md.- This plan moves to
docs/plans/completed/databricks-mock-data-and-prompt-1.mdwithstatus: completedandcompleted_on: <date>.