Skip to content

Databricks mock-data ingest + Prompt 1 (connectivity)

Status: completed 2026-05-06. All Phase A / B / C acceptance checks green. prompt_1_databricks_client and acceptance_dryrun_2 marked complete in part-c-databricks-prototype.md.

Final results: - Phase A: 3 SQL files (backend/databricks_schema/). - Phase B: make seed-databricks runs in 205s (under MD-2's 5-min budget). Row counts: customers 5000, devices 7582, issue_sessions 20000, claims 10326, ai_recommendations 20000, outcomes 15257, operations_context 48. Validation views return realistic shapes (CA/TX/FL/NY top regions, ~50% repair rate, $777K cost_avoided_mtd). - Phase C: GET /v1/databricks/health returns 200 / rows_sampled=5000 in ~1s warm. Negative path (bogus host) returns 503 application/problem+json kind=databricks_unavailable in 0.7s. 9/9 client tests pass (pytest -m databricks_live and unit). Full suite 24/24, no regressions.

Volume deviation from MD-2: issue_sessions defaulted to 20K instead of 50K — the Free Trial Serverless Starter warehouse couldn't push 50K via batched INSERT VALUES within MD-2's 5-min wall budget. Acceptance check B2's active_issues ~15K therefore landed at ~4K instead. The qualitative shape (regional skew, conversion rate, weekly cycle) is preserved; only the absolute volume is reduced. Documented in generators.py module docstring + Lessons-learned candidate below.

Why this ordering

Phase B's --reset mode runs the Phase A DDL itself, so the user does not have to paste SQL into the Databricks UI manually. By the time Phase C runs curl /v1/databricks/health, the table exists with seeded, business-correlated data, and the acceptance check is meaningful.

1
2
3
4
Phase A (MD-1) ──► Phase B (MD-2) ──► Phase C (Prompt 1) ──► Acceptance: rows_sampled = 5000
schema DDL        Python generator     FastAPI client +
files (no run)    + uploader runs      /v1/databricks/health
                  the DDL on --reset

Phase B and Phase C share the databricks-sql-connector install — adding it once in Phase B's requirements.txt edit covers both.

Cross-references — read before starting

Reference Why
databricks-mock-data.md §§ MD-1, MD-2, MD-3 Source of truth for the schema, the generator distributions, and the env block.
docs/plans/active/part-c-databricks-prototype.md prompt_1_databricks_client The original Prompt 1 todo that this plan executes (subset of).
prompts.md Prompt 1 (lines 86-128) The literal Prompt 1 spec — file list, behavior, acceptance.
prompts.md § Time-box discipline (line 459) Why we ship PAT-only and defer OAuth M2M / Service Principal.
docs/lessons-learned.md Five entries directly compliance-checked below.

Step 0 — User actions before execution

  1. Confirm Free Trial setup per databricks-mock-data.md § What you need before starting lines 19-29: workspace URL noted, PAT generated, Serverless Starter warehouse provisioned (auto-stop 10 min), schema asurion_prototype created in the workspace catalog (or let Phase B create it via CREATE SCHEMA IF NOT EXISTS).
  2. Add to .env (the host already has DATABRICKS_HOST + DATABRICKS_TOKEN):
    1
    2
    3
    4
    DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/<your-warehouse-id>
    DATABRICKS_CATALOG=workspace
    DATABRICKS_SCHEMA=asurion_prototype
    DATABRICKS_HEALTH_TABLE=customers
    
    Get HTTP_PATH from Databricks UI → SQL Warehouses → your warehouse → Connection details. The https:// already in DATABRICKS_HOST is fine — Phase C's _normalize_host strips it defensively.

Phase A — Schema DDL files (~30 min, mirrors MD-1)

Files created (no execution; Phase B's main script runs them):

  • backend/databricks_schema/01_create_schema.sqlCREATE SCHEMA IF NOT EXISTS workspace.asurion_prototype; USE workspace.asurion_prototype;
  • backend/databricks_schema/02_tables.sql — 7 CREATE OR REPLACE TABLE statements per MD-1 lines 60-138. Idempotent. Time-series tables PARTITIONED BY DATE(timestamp_col).
  • backend/databricks_schema/03_validation_views.sql — 4 views (compile against empty tables; populate with real numbers after Phase B):
    • vw_kpi_executive_overview
    • vw_cost_avoided_by_region_30d
    • vw_repair_replace_conversion_30d
    • vw_issue_volume_by_region

Phase B — Mock data generator + uploader (~75 min, mirrors MD-2)

Adds dev deps + builds the seeded generator + ingests data through batched INSERTs.

New deps (backend/requirements.txt)

1
2
3
databricks-sql-connector>=3.0,<4.0
Faker>=24.0,<30.0
numpy>=1.26,<3.0

Files created

  • backend/scripts/databricks_mock_data/__init__.py
  • backend/scripts/databricks_mock_data/generators.py — 7 generator functions per MD-2 lines 179-246. Each takes seed: int, builds numpy.random.RandomState(seed), returns list[dict].
  • backend/scripts/databricks_mock_data/uploader.pydatabricks-sql-connector wrapper. Per table: drop + recreate from Phase A DDL → batched 1000-row INSERTs with progress prints → final SELECT COUNT(*).
  • backend/scripts/databricks_mock_data/main.py — CLI entry. --reset drops the 7 tables, runs Phase A DDL, generates + uploads all entities, runs the 4 validation views and prints first 5 rows of each.

Makefile

seed-databricks:
    docker exec -e PYTHONPATH=/app api python -m scripts.databricks_mock_data.main --reset

Phase B acceptance (MD-2 lines 276-284)

# Check
B1 make seed-databricks completes in <5 min with no errors
B2 vw_kpi_executive_overview shows active_issues ~15,000, claims_in_progress ~5,000, cost_avoided_mtd $400K-$1.2M
B3 vw_cost_avoided_by_region_30d returns 12 rows; CA/TX/FL/NY at top
B4 vw_repair_replace_conversion_30d shows 60-65% repair rate (not 100%, not 0%)
B5 Re-running make seed-databricks produces identical row counts (seed determinism)

Phase C — FastAPI client + /v1/databricks/health (~90 min, mirrors Prompt 1)

Files created

  • backend/app/databricks/__init__.py
  • backend/app/databricks/exceptions.pyDatabricksError base + 4 subclasses
  • backend/app/databricks/client.py_normalize_host, _auth_kwargs (PAT only), pooled connections, execute()
  • backend/app/databricks/health.pyrun_health_check() against {catalog}.{schema}.{health_table} with identifier validation
  • backend/tests/test_databricks_client.py — 3 tests (1 marked databricks_live, 2 unit)

Files modified

  • backend/app/settings.py — 11 new DATABRICKS_* fields
  • backend/app/main.pyGET /v1/databricks/health returning 503 application/problem+json on DatabricksError
  • docker-compose.yml — pass through 11 vars with ${VAR:-default} syntax
  • Makefiledatabricks-health target
  • backend/tests/conftest.py — register databricks_live marker

Phase C acceptance gate

# Check
C1 make up boots the stack; docker exec api env \| grep DATABRICKS_ shows all 11 vars
C2 curl http://localhost:8000/v1/databricks/health returns 200 with rows_sampled = 5000
C3 make databricks-health works as a shorthand for #C2
C4 docker exec api pytest backend/tests/test_databricks_client.py -v -m databricks_live passes
C5 Negative path: bogus DATABRICKS_HTTP_PATH → 503 with Content-Type: application/problem+json and structured body — no stack trace, no 500

Lessons-learned compliance

  • § Stale containers hide UI workmake up after every requirements.txt / settings.py / compose / main.py edit before testing.
  • § Watch the live env vars on make up, not just the file — gate before declaring health green: docker exec api env | grep DATABRICKS_ shows all 11 vars.
  • § Mocks must be opt-in, never silent fallback — no MockDatabricks. Connection failure = 503 + application/problem+json. Mirrors ADR-008.
  • § Bake auxiliary code into the imagebackend/app/databricks/ lives under already-COPY'd app/; backend/scripts/databricks_mock_data/ lives under already-COPY'd scripts/. No Dockerfile edits needed.
  • § pydantic-settings parses empty env strings strictly for booleans — avoided by typing all new settings as str / int / Literal.

Out of scope (deferred)

Item Where it belongs
seed.py source_query mappings for the 3 demo metrics Prompt 2 (next plan) — MD-3 lines 315-345
config/data_dictionary.yaml hand-write for the 7 tables Prompt 2 — MD-3 lines 307-313
config/metric_routing.yaml per-metric routing Prompt 2 — MD-3 lines 347-348
metrics_catalog lineage column ALTERs Prompt 2
SQL Generator service + safety layer + sqlglot Prompt 3
POST /v1/widgets/{id}/data resolver + Redis cache + per-metric routing Prompt 4
Frontend useWidgetData + SourceBadge + MetricInfoBadge enhancements Prompt 5
Demo runbook + architecture diagram + whats-mocked doc Prompt 6
OAuth M2M / Service Principal auth wiring Time-box deferred per prompts.md line 459; dispatch shape in place

Done when

  • All 5 Phase B acceptance checks + all 5 Phase C acceptance checks pass.
  • prompt_1_databricks_client and acceptance_dryrun_2 are marked complete in part-c-databricks-prototype.md.
  • This plan moves to docs/plans/completed/databricks-mock-data-and-prompt-1.md with status: completed and completed_on: <date>.