Lessons Learned¶

Append-only. Each entry: Context / Rule / Why / How to apply.

Read this file at session start before changing infra, the Clarifier graph, or Bedrock wiring. Every entry below cost real time on a 2-day clock.

Table of contents¶

Stale containers hide UI work
Bedrock model access is per-feature, not per-account
pydantic-settings parses empty env strings strictly for booleans
LangGraph state inheritance — subclass, don't redefine channels
Bake auxiliary code into the image, don't docker cp
Mount host ~/.aws into the api container
Static-check regexes — anchor to actual literal shape
The Clarifier topology is reusable; the schema is not
Prompt engineering still matters with structured output
Promote eval schemas, don't import from scripts/
Bedrock tool-use rejects top-level oneOf schemas
Haiku 4.5 silently drops deeply-nested fields — split into stages
Don't ask users for type signatures; ask for examples
HITL answers must be lifted into intent before synthesis
Naive \{[^}]*\} regex breaks on nested TS types
Watch the live env vars on make up, not just the file
Mocks must be opt-in, never silent fallback
Edit-mode gates destructive controls; view mode hides them
Don't double-stack h-72 on parent and child cards
Vitest fake timers break RTL waitFor; use real timers + sleep
Keyboard-sensor reorder needs real layout; assert activation, not movement
KPI delta color must derive from per-metric direction, not delta sign
Empty-state copy must not contradict an adjacent populated tile
Decorative <button type="button"> clothing makes accessibility audits fail
Naive ; split breaks any SQL with ; in a string literal
Databricks SQL connector defaults retry for 15 minutes — override for interactive endpoints
Bake DDL into the image alongside scripts that consume it
Free Trial Serverless Starter caps batched-INSERT throughput; right-size mock volumes
Dual-DDL source of truth: db/init.sql AND _TABLE_DDL keep metrics_catalog honest
MetricEntity Literal entrenchment — keep entity bare, schema lives on source_schema
Old custom-widget custom_widget_placeholder orphan caught by routing fail-loud
make demo-reset chicken-and-egg when boot validator blocks startup
make docs-validate — pick image + ruleset that match the prototype's reality
Resolver-side imports of app.sql_gen.generator MUST be lazy
Pydantic-typed test fields don't dict-compare; normalize with model_dump()
Reasoning-tier latency busts the SQL-gen budget — probe before committing
Reroute moments need a pre-existing column on the destination side
Dual source of truth: metrics_catalog.source_query AND the Postgres allowlist
seed_if_empty is INSERT-only — TRUNCATE before re-seeding catalog edits
Clarifier and SQL generator need a shared column naming contract

Stale containers hide UI work¶

Context: frontend/src/App.tsx, frontend/src/components/Header.tsx, docker-compose.yml Rule: After any source change — backend OR frontend — re-run make up (which does docker compose up --build -d) before testing the UI. The --build flag is incremental, so warm rebuilds take ~5s. Never trust the running stack after editing source without rebuilding. Why: A user reported "Add Widget does nothing." The App.tsx and Header.tsx already wired WidgetBuilderModal correctly, but the running web container had been built from an earlier source tree. docker compose up -d (without --build) is a no-op when images exist, so the stale image kept serving old JS. We wasted time inspecting the frontend code that was already correct. How to apply: - Use make up exclusively. Do not docker compose up -d by hand. - If a UI change isn't visible after make up, hard-refresh the browser (Cmd-Shift-R). If still missing, make down && make up. - For backend, the same applies — adding a new Python file is invisible until make up rebuilds.

Bedrock model access is per-feature, not per-account¶

Context: backend/app/widgets/llm.py, backend/scripts/clarifier_eval/bedrock_client.py Rule: Having bedrock:InvokeModel access to an inference profile does NOT imply tool-use (tool_use content blocks) access. Probe the specific feature path before assuming. Tool-use access also drifts over time as models age into LEGACY or move behind a marketplace gate — re-probe whenever a model id changes. Why: PRD §10.5 originally specified us.anthropic.claude-sonnet-4-20250514-v1:0 (Sonnet 4). Plain text invocations succeeded against that ID in the hackathon AWS account. The first Clarifier run, which uses Anthropic's tool-use protocol for structured output, returned AccessDeniedException. We had to probe per-model to find one that supported tool-use; us.anthropic.claude-haiku-4-5-20251001-v1:0 (Haiku 4.5) worked. A re-probe on 2026-05-06 against hackathon-async showed the access surface had moved again:

Model	Tool-use status (2026-05-06)
`us.anthropic.claude-sonnet-4-20250514-v1:0`	`LEGACY` — `AccessDeniedException`
`us.anthropic.claude-sonnet-4-5-20250929-v1:0`	AWS Marketplace subscription not enabled
`us.anthropic.claude-sonnet-4-6`	Works ✓ (now the fast tier default per ADR-011)
`us.anthropic.claude-opus-4-5-20251101-v1:0`	Works ✓
`us.anthropic.claude-opus-4-7`	Works ✓ (now the reasoning tier default per ADR-011)
`us.anthropic.claude-haiku-4-5-20251001-v1:0`	Works (drops nested fields — see entry below)

How to apply: - Treat tool-use access as feature-scoped and time-scoped. After every model upgrade or account migration, re-probe — don't assume yesterday's allow-list still applies. - Per ADR-011, expose two env vars (BEDROCK_MODEL_ID for the fast tier and BEDROCK_REASONING_MODEL_ID for reasoning) at every layer (docker-compose.yml, Makefile, settings.py, .env). Pin both in .env so a fresh-shell make up doesn't silently revert to a default that's gone LEGACY. - For features that require tool-use, document the working tier defaults in README.md and CLAUDE.md and call out which call site uses which tier. - Keep the MockLlm fallback intact (ADR-002 / ADR-008) — when Bedrock errors for any reason, the demo must still run via BUILDER_MODE=offline.

pydantic-settings parses empty env strings strictly for booleans¶

Context: backend/app/settings.py, docker-compose.yml Rule: Don't pass USE_BEDROCK: ${USE_BEDROCK} (which becomes empty string when unset) into a Pydantic BaseSettings field typed bool. Provide a default: USE_BEDROCK: ${USE_BEDROCK:-false}. Why: First boot crashed with pydantic_core._pydantic_core.ValidationError: ... use_bedrock: Input should be a valid boolean, unable to interpret input [type=bool_parsing, input_value='']. Pydantic-settings does not coerce "" to False for bool fields — it raises. The shell happily expands an unset variable to empty string in compose files. The fix is shell-level defaulting, not Pydantic-level coercion. How to apply: - Every env var passed through docker-compose.yml to a typed Pydantic field needs ${VAR:-default} syntax. - For optional credentials (AWS_*), empty-string default is fine because boto3 ignores empty AWS_* and the field type is str | None. The bool case is the trap.

LangGraph state inheritance — subclass, don't redefine channels¶

Context: backend/scripts/clarifier_eval/graph.py, backend/app/widgets/state.py Rule: When extending a production TypedDict LangGraph state for a new graph, subclass it (class EvalState(WidgetClarifierState, total=False): ... and add only NEW fields). Do NOT redefine inherited Annotated channels with your own reducer. Why: First eval graph compile crashed with ValueError: Channel 'human_responses' already exists with a different type. The eval state had redeclared human_responses: Annotated[list, append] to "be safe," but WidgetClarifierState already declares it. LangGraph requires every node in the compiled graph to agree on a channel's annotated type — including the reducer — so two declarations with different Annotated instances are a compile-time conflict, even if the reducers are semantically identical. How to apply: - Subclass the production state, add only fields that don't exist upstream. - If you genuinely need a different reducer for an inherited channel, you need a different channel name, not the same name with a different reducer. - This applies any time you reuse production nodes in a new graph — the state shape they read/write must match exactly.

Bake auxiliary code into the image, don't `docker cp`¶

Context: backend/Dockerfile, Makefile Rule: If new code under backend/ needs to run inside the api container, add it to the Dockerfile (COPY scripts ./scripts). Do not rely on docker cp after-the-fact. Why: The first eval-harness run depended on docker cp .../scripts api:/app/scripts because the Dockerfile only copied app/. After the next make up (which rebuilt the image), the scripts vanished and the run failed with ModuleNotFoundError: No module named 'scripts'. Anything not baked in is invisible after the next rebuild. How to apply: - Update backend/Dockerfile whenever you add a top-level dir under backend/ that the container needs. - Keep import-time concerns separated: the eval harness imports production app.widgets modules but is never imported by app/main.py, so baking it in adds zero runtime cost.

Mount host `~/.aws` into the api container¶

Context: docker-compose.yml Rule: Mount ${HOME}/.aws:/root/.aws:ro into any container that needs AWS access. Pass AWS_PROFILE and the AWS_* credential vars through env. Do not bake credentials into the image. Why: SSO-based and short-lived session credentials are the norm. Without the mount, AWS_PROFILE=foo resolves on the host but blank inside the container, forcing a per-credential-rotation docker exec dance. With the mount, the same aws sso login on the host works inside the container immediately. How to apply: - The compose file already does this — keep it. The mount is a no-op when ~/.aws doesn't exist (Docker creates an empty dir). - For env-var-only credentials (CI), the same compose block works because the env vars are passed through and boto3 prefers env over profile.

Stale shell-side AWS creds poison `make up` — scrub in the recipe¶

Context: docker-compose.yml, Makefile Rule: Every Makefile target that calls docker compose up must scrub shell-side AWS vars (unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN AWS_PROFILE) and source .env before invoking compose. This makes .env the single source of truth, immune to whatever the caller's shell exports. Why: Docker Compose env precedence is shell env > .env. If your shell has stale static creds (from a prior aws-export-creds, aws sts assume-role, or an old SSO export) or an unrelated AWS_PROFILE (e.g. asurion-mobility-ac-nonprod.dev instead of hackathon-async), compose interpolates those values into the api container — not the values pinned in .env. boto3 then sees explicit AWS_SESSION_TOKEN and skips the ~/.aws/sso/cache refresh path entirely. Symptom: every Bedrock call inside the widget builder returns Bedrock invoke failed: ExpiredTokenException even though aws sts get-caller-identity works fine on the host. Diagnose: docker exec 2026-hackathon-api-1 env | grep ^AWS_. If AWS_PROFILE is anything other than the value in .env, or AWS_SESSION_TOKEN is non-empty, that's the bug. How to apply: - make up and make up-offline already scrub shell AWS vars in the recipe. No manual unset needed. - If the SSO token itself has expired (>1h since last login), run make refresh-aws — it does aws sso login + container recreate in one step. - Do not paste static credentials into .env to "fix" this — short-lived role creds expire in ~1h and ~/.aws/sso/cache is the only path that refreshes itself.

Static-check regexes — anchor to actual literal shape¶

Context: backend/scripts/clarifier_eval/validators/static_checks.py Rule: When writing deterministic checks against generated code, run them against a known-good baseline first. Regex like \b(critical|high|medium|low)\b will false-positive on object keys ({ critical: 'rose' }); a Tailwind color-family check naïvely matched against \b\w+-\d+\b flags text-sm and divide-y as unknown color families. Why: First eval run reported failures on review-quality TSX. Both checks were wrong, not the generated code. We almost iterated on the prompt to "fix" output that was already correct. How to apply: - Write the validator. Then hand-craft a passing fixture and run it. If it fails, fix the validator. - Then hand-craft a known-bad fixture. If it passes, fix the validator. - Only THEN trust the validator's verdict on real LLM output.

The Clarifier topology is reusable; the schema is not¶

Context: backend/app/widgets/graph.py, backend/scripts/clarifier_eval/graph.py Rule: When prototyping a new LLM-driven feature that has the shape "load context → extract intent → detect gaps → ask human → synthesize → critic," reuse the front half of the Clarifier (contextLoader, intentExtractor, gapDetector, questionPrioritizer) by importing the nodes. Replace only specSynthesizer and critic with feature-specific equivalents. Why: The codegen-path eval was a swap of two nodes (and a state-extension) on top of the production graph. The whole HITL machinery, prompt loading, and gap detection just worked. The discriminated-union output was the only piece coupled to widgets — and we replaced it cleanly with ComponentSpec. How to apply: - Always import production nodes; never copy-paste their bodies into a new file. - Keep eval/experimental code under backend/scripts/clarifier_eval/ (or a sibling) to avoid contaminating the production import graph. - If you find yourself needing to fork a production node, that's a signal to refactor it to take an injected dependency, not to copy it.

Prompt engineering still matters with structured output¶

Context: backend/scripts/clarifier_eval/prompts/component_synthesizer.md Rule: Tool-use / structured output guarantees the shape of the response, not the content. Encode hard rules (import paths, accessibility constraints, allowed Tailwind families, forbidden patterns) explicitly in the prompt — the schema alone won't enforce them. Why: First run produced TSX that imported Alert from './types' (a path that doesn't exist) and used <div onClick> instead of <button>. Both were caught by the static checks, but the schema was satisfied. Adding two explicit clauses to the prompt — import { Alert } from '../api'; and use <button type="button"> for non-navigational clickable elements — fixed both on the next run. How to apply: - Write rules in the prompt as imperative bullets, not as descriptions. - For each static check that fails, add a corresponding "do this" / "don't do this" line to the prompt. The pair of (static check + explicit prompt rule) is more reliable than either alone. - Re-run the harness after every prompt change. Artifacts under artifacts/clarifier-eval/<run-id>/ make diffs cheap.

Promote eval schemas, don't import from `scripts/`¶

Context: backend/app/widgets/schemas.py, backend/scripts/clarifier_eval/schemas.py Rule: Production code (backend/app/) MUST NOT import from backend/scripts/. When an eval-harness type graduates to production behavior, move the type into app/ and have the eval harness re-import it from there. Why: ComponentSpec started life under scripts/clarifier_eval/ because the eval harness was the only consumer. When ADR-006 added the custom widget variant, two paths existed: (a) re-export from scripts/ into app/, or (b) lift the type into app/ and let scripts/ import from app/. Option (a) inverts the dependency direction in a way the Dockerfile does not promise to honor — scripts/ is baked into the image specifically for scripts, not as a peer of app/. Option (b) makes the production graph self-contained; the harness becomes a strict consumer. How to apply: - When a schema, prompt, or validator under scripts/clarifier_eval/ becomes part of production behavior, move the file into app/widgets/ (or wherever it belongs in app/) and replace the original with a one-line re-export from the new location. - After the move, run rg "from scripts" backend/app/ — it must return zero hits. (The reverse is fine: from app.widgets... import inside scripts/ is the supported direction.) - The static checks ported into app/widgets/validators.py are a port (not an import) of the eval-harness validators/static_checks.py because the production contract differs (no imports allowed, vs the harness's allowlist). Keep both copies — they serve different gates.

Bedrock tool-use rejects top-level `oneOf` schemas¶

Context: backend/app/widgets/nodes/spec_synthesizer.py, backend/app/widgets/schemas.py Rule: Anthropic's tool-use API (bedrock-runtime Converse and InvokeModel for Claude) requires the tools[].input_schema to be a flat JSON object with type: "object" at the root. Pydantic's discriminated-union schema ({"oneOf": [...], "discriminator": {...}}) lacks type at the root and is silently rejected with ValidationException: tools.0.custom.input_schema.type: Field required. Don't pass a discriminated WidgetSpec schema to Bedrock — dispatch on intent.type and pass the per-variant schema (KpiSpec, ChartSpec, TableSpec, CustomSpec). Why: Phase 0 of the metric-aware Clarifier work ran a real-Bedrock baseline and found every data-path call was silently falling back to MockLlm because the synthesizer used WidgetSpec.model_json_schema() directly. The error was logged at debug level and masked by the existing MockLlm fallback (ADR-002), so unit tests passed and CI passed. We only caught it by reading the raw fallback trace. The WidgetSpec discriminated union is still the source of truth for the database and frontend; we flatten only at the LLM boundary. How to apply: - For any new LLM call that needs a discriminated output, dispatch on the discriminator before the call and pass the variant's flat schema. - Bedrock fallbacks must log loudly (warning level minimum) — if the demo silently always-falls-back, the entire Bedrock code path is dead code. - Add a real-credential smoke test (e.g., /tmp/e2e-persist.py or an entry in scripts/verify-acceptance.sh) for any Bedrock-routed feature; mock-only tests can't catch this class of regression.

Haiku 4.5 silently drops deeply-nested fields — split into stages¶

Context: backend/app/widgets/nodes/spec_synthesizer.py (_custom_synth), backend/app/widgets/prompts/component_synthesizer.md Rule: When asking Bedrock Haiku 4.5 to emit a payload with deeply-nested required fields (e.g., CustomSpec → component → tsx_source + imports_used + tailwind_classes_used + sibling mock_data + data_intent), expect the model to drop one or more nested branches even when each is required in the JSON schema. The fix is not more prompt engineering — it's splitting the call into stages where each stage emits a flat object, then assembling the envelope deterministically in Python. For custom widgets: stage 1 asks the LLM only for a flat ComponentSpec (TSX + metadata); stage 2 in Python wraps it with the resolved metric block, derived data_intent, and derived mock_data. Why: The first three iterations of _custom_synth against real Bedrock returned valid ComponentSpecs but missing or partial mock_data/data_intent/metric, failing Pydantic validation downstream. Re-prompting with stronger language ("you MUST include all of these fields") didn't help — Haiku 4.5 just truncated different fields. Splitting into a flat-schema stage + Python assembly succeeded on the first attempt and is now deterministic. Code-generation also needs max_tokens=4096 (default 1024 truncates React components mid-function). Update 2026-05-06 (per ADR-011): the reasoning-heavy call sites — _data_synth, _custom_synth, and app.sql_gen.generator.generate_sql — now route to the reasoning tier (get_llm("reasoning"), default claude-opus-4-7). Opus 4.7 does not exhibit the dropped-nested-field failure mode in our probes, so the rule mostly converts to a defensive guarantee: keep the staged shape and the Python-side composition because they are cheap and they survive any future tier downgrade (e.g. quota throttle on Opus). Do NOT delete the staging just because the current tier seems robust — the lesson is "design the call surface so it doesn't depend on the model behaving," not "this specific model is reliable." How to apply: - For LLM payloads with > ~2 levels of nesting, split into stages. Let the LLM produce flat objects; let Python compose the envelope. - Anything you can derive deterministically (data_intent from a MetricDefinition, mock_data from intent.custom_examples) — derive it. Don't make the LLM regenerate it. - For code generation specifically, set max_tokens=4096 (or higher) on the Bedrock call. Default token budgets are tuned for chat, not for source files. - When choosing a tier for a new call site, re-read this entry. If the schema has nested required fields, default to the reasoning tier and only step down to fast after a probe confirms field-completeness across ≥10 representative inputs.

Don't ask users for type signatures; ask for examples¶

Context: backend/app/widgets/prompts/intent_extractor.md, backend/app/widgets/nodes/question_prioritizer.py Rule: When a Clarifier needs to learn the shape of user data, ask for example rows (custom_examples: list[dict]), not for a TypeScript interface. Infer the type from the examples server-side. The end user is not necessarily a developer; even when they are, examples carry richer signal (realistic units, value ranges, plausible nullability) than a hand-written interface. Why: ADR-006's first cut of the custom widget path asked the user to paste a TypeScript interface for Props. Research into Holistics, ThoughtSpot, and Lightdash showed every modern AI BI tool converges on examples-first / catalog-first; nobody asks for a type signature. The TS-paste UX leaked the implementation language and made the modal feel like a developer console rather than a product surface. The replacement (a free-text JSON examples field, plus a single-select catalog of known metrics with definition tooltips, plus a "Define a new metric" plain-English escape hatch) tested cleanly against Bedrock without any prompt changes — the LLM happily infers Props from examples. How to apply: - For any "describe your data" question, ask for 1-3 example rows in plain JSON. Reject "paste a type / interface / schema" as a UX shortcut. - For any "pick a metric / dimension / entity" question, render the live catalog as a single-select with hint populated from each row's definition. Always include a "Define new …" escape hatch that gathers the catalog row's fields as plain-English sub-questions. - Surface the resolved definition in the UI immediately. A widget that doesn't say what it measures is a widget the user can't trust.

HITL answers must be lifted into intent before synthesis¶

Context: backend/app/widgets/nodes/spec_synthesizer.py (_lift_answers_into_intent), backend/app/widgets/nodes/gap_detector.py Rule: The gap detector treats a question as satisfied as soon as the user answers it, but the answer lives only in state.human_responses. It never gets back into state.intent. Any synthesizer code that reads intent.<field> (the LLM prompt, deterministic fallbacks like _derive_mock_data, _derive_data_intent) will see the field as missing unless you explicitly lift the answer first. Run _lift_answers_into_intent(state, intent) at the start of every synth path. Why: A real-Bedrock end-to-end run of the custom-widget path looked correct in the logs (the LLM was called, returned a valid ComponentSpec with alerts: Array<{...}> props), but the persisted widget rendered its empty state. The user pasted three example alerts as a JSON string in the custom_examples answer, but _derive_mock_data read intent.get("custom_examples") and got None — so it filled the array prop with []. The answer was sitting in state.human_responses the whole time, unparsed. Same trap waits for time_window, accent, layout, value_format, chart_kind, columns, dimensions — anything we ask via HITL but don't guarantee the LLM intent extractor populated. How to apply: - At the top of _data_synth and _custom_synth, do intent = _lift_answers_into_intent(state, state.get("intent") or {}) before reading any field. - For JSON-typed answers (custom_examples), parse with json.loads; warn-and-skip on JSONDecodeError. For comma-separated text (columns, dimensions), split on commas. For single_select / single_text fields, copy through. - Don't mutate state["intent"] in place — return a new dict. The synthesizer relies on functional state for replay.

Naive non-greedy brace regex breaks on nested TS types¶

Context: backend/app/widgets/nodes/spec_synthesizer.py (_extract_props_body, _props_from_interface) Rule: When parsing a generated TypeScript interface Props { ... }, never use \{[^}]*\} to grab the body. The [^}]* class stops at the first closing brace, which on any nested-object/array prop (items: Array<{ id: string; ... }>) closes on the inner brace and silently includes only a fragment. Walk braces with a depth counter (or use a real TS parser) and treat <...> the same way for generic args. Why: The custom-widget path derives mock_data keys from the LLM-emitted props_interface so the generated component's destructured props line up with the runtime data. With the naive regex, an interface like interface Props { items: Array<{ id: string; title: string; ... }> } parsed as if id, title, severity, description, minutes_ago were top-level props — leaking nested fields and stamping the wrong shape onto mock_data ({items: [], title: "", severity: "", description: "", minutes_ago: 0} instead of {items: [...]}). The CustomWidgetRenderer then destructured items as [] and rendered the empty state. The fix: a tiny brace-balanced scan for the outer {...}, plus depth-aware tokenization for ; separators so unions like 'a' | 'b' | 'c' and inline object types stay grouped. How to apply: - For any "extract the body of { … }" task in unconstrained text, balance braces explicitly. Never assume the body is brace-free. - When tokenizing field separators (;, ,) inside a balanced body, also track < > and inner { } depth so nested generics / objects don't split mid-type. - Keep the parser best-effort: when parsing fails, fall back to the historical {"items": [...]} mock_data shape. Don't crash widget synthesis on a malformed interface; the LLM will repair on the next iteration.

Watch the live env vars on `make up`, not just the file¶

Context: docker-compose.yml, Makefile Rule: docker compose up -d --build only re-reads env values that you set on the command line where compose is invoked. A subsequent docker compose up -d --build api (without re-exporting USE_BEDROCK, AWS_PROFILE, BEDROCK_MODEL_ID) silently reverts to the compose file's ${VAR:-default} fallbacks. The container starts, your BEDROCK_MODEL_ID quietly resets to claude-sonnet-4, and every Bedrock call falls back to MockLlm — there is no error. Why: While debugging the custom-widget path I rebuilt the api container with docker compose up -d --build api to pick up a code change. The first invocation that day had USE_BEDROCK=true AWS_PROFILE=alphabuilders BEDROCK_MODEL_ID=us.anthropic.claude-haiku-4-5-20251001-v1:0 exported. The follow-up rebuild dropped them, and the running container came back up with USE_BEDROCK=false. The UI still produced widgets (because MockLlm always succeeds), and the only signal was the persisted spec carrying the MockLlm placeholder display name and assumption strings. Ten+ minutes lost chasing a phantom regression. How to apply: - After every make up / docker compose up, run docker exec <api> env | grep -E 'USE_BEDROCK|AWS_PROFILE|BEDROCK_MODEL_ID' and confirm the values match what you intend. - Prefer make up (which sources from a known environment) over ad-hoc docker compose up -d --build api mid-debugging. - When a generated widget says display_name: GeneratedMockCustomCard or its assumptions mention "Offline MockLlm", that's MockLlm — never assume Bedrock ran successfully without checking.

Mocks must be opt-in, never silent fallback¶

Context: backend/app/settings.py (builder_mode, resolved_builder_mode), backend/app/widgets/llm.py (get_llm, BuilderModeError), backend/app/widgets/nodes/spec_synthesizer.py, backend/app/widgets/nodes/intent_extractor.py, backend/app/widgets/runner.py (_classify_exc), frontend/src/widgets/useWidgetClarifier.ts, frontend/src/widgets/WidgetBuilderModal.tsx, frontend/src/components/Header.tsx, ADR-008. Rule: A mock LLM exists only when BUILDER_MODE=offline is explicitly set. In live mode, every Clarifier path MUST reach Bedrock; if Bedrock init or invoke fails, raise BuilderModeError and let the runner surface a structured error SSE event (kind: "builder_unavailable") so the modal renders a config-style banner. Never try: bedrock except: mock. Never default USE_BEDROCK to false. The header always carries an OFFLINE MODE pill when resolved_builder_mode() == "offline" so the operator can never confuse the two paths. Why: During the metric-aware Clarifier work, MockLlm silently masked six distinct real failures back-to-back: tool-use schema rejection, deeply-nested-field truncation, AWS SSO read-only cache, wrong AWS profile name, env-var drift across docker compose up calls, and a model id that lacked tool-use access. Each time the persisted spec carried display_name: GeneratedMockCustomCard and assumption strings tagged "Offline MockLlm — deterministic placeholder", but the user (and sometimes the agent) saw only "the widget was built" and moved on. Hours were lost diagnosing phantom regressions when the actual cause was config drift. The lesson: the mock is a demo prop, not a fallback strategy. PRD ADR-002 wanted "offline demo support"; it never asked for "silently substitute placeholders for real model output", and that distinction is the difference between a debuggable system and a Potemkin village. How to apply: - Wire any new LLM call site through get_llm() and let exceptions propagate. Don't catch LlmError and call MockLlm() to recover. - For "demo without AWS" paths, expose a single env var (BUILDER_MODE=offline) and gate the mock backend on that var alone. make up-offline is the user-facing entry point. - Surface the resolved mode on every API response that the UI uses to render builder UX (here: /v1/dashboard/state.builder_mode). The frontend MUST show an OFFLINE pill when offline so it's impossible to confuse with a real run. - For SSE Clarifier errors, classify the exception (BuilderModeError → kind: "builder_unavailable", LlmError → kind: "llm_error", else unknown) so the modal can pick the right copy and CTA.

Edit-mode gates destructive controls; view mode hides them¶

Context: frontend/src/components/Header.tsx, frontend/src/App.tsx (editMode), frontend/src/widgets/MyWidgetsRail.tsx. Rule: Destructive widget controls (dismiss/remove) must be invisible in view mode and only appear when the user explicitly enters an "Edit dashboard" mode. The header carries a single toggle that swaps between "Edit dashboard" and "Done"; entering edit mode also gives every card a dashed brand-blue outline so the operator never has to wonder which mode they're in. Place the destructive control well away from the metric-info badge — they live on opposite corners of the card (Remove top-left, MetricInfoBadge top-right) so a misclick on one cannot trigger the other. Why: A user reported the original card layout had × (dismiss) and (i) (metric definition) sitting 4 px apart in the top-right of every card. In view mode this guarantees periodic accidental deletions; in any context it conflates "tell me what this measures" with "remove this from my dashboard". The Tableau / Notion / Looker pattern — view-mode shows nothing destructive, edit-mode reveals controls + a visual border state — solves both. Auto-exit edit mode when the rail empties to avoid an empty-but-dashed grid that looks broken. How to apply: - Put irreversible actions behind a mode toggle, not a hover or proximity to a non-destructive icon. Hover-only reveals are accessibility-hostile and easy to mistrigger on touch. - Use a clear visual indicator for edit mode (dashed outline + brand ring + footer copy that says "click × to remove · press Done when finished"). Don't rely on the toggle button alone. - When the underlying collection that edit mode operates on becomes empty, force-exit edit mode in a useEffect. An empty edit-mode UI is a broken UI.

Don't double-stack `h-72` on parent and child cards¶

Context: frontend/src/widgets/MyWidgetsRail.tsx (card), frontend/src/widgets/WidgetPreview.tsx (inner). Rule: When a card uses flex flex-col h-72 overflow-hidden with flex-1 overflow-hidden around the preview and a footer below it, the preview MUST size to h-full, never to a fixed pixel height. Stacking h-72 on both parent and child gives the inner content the full 288 px while the parent can only show 288 - footer_height, so the bottom of the chart (axis labels, legend rows) gets clipped by the parent's overflow-hidden. Why: The chart widget's TX/CA legend was disappearing from every persisted chart preview on the dashboard. Inspection traced the bug to WidgetPreview wrapping each variant in <div className="relative h-72"> while the parent card in MyWidgetsRail was already h-72 and had a ~30 px footer below the preview. The two h-72s collided and the card's overflow-hidden swallowed the bottom row of the chart. Switching the inner wrappers to h-full (and adding h-full to the KPI grid wrapper which had no height at all) let the flex parent correctly distribute height between preview and footer. Generalization: any time a child sets a numeric height inside a flex parent that also constrains height, you've created a fight that the overflow-hidden will silently win. How to apply: - Inside a flex flex-col parent that constrains height, use h-full on children — let the flex layout decide how much room each child gets. - If you need a minimum height on a child, use min-h-… rather than a fixed h-…, so the flex layout can still shrink the child when the parent is short. - When debugging "content is being cut off", first check whether two ancestors are setting the same fixed height with overflow-hidden between them.

Vitest fake timers break RTL `waitFor`; use real timers + sleep¶

Context: frontend/src/dashboard/__tests__/useDashboardLayout.test.ts, frontend/src/dashboard/useDashboardLayout.ts (250 ms debounce in schedulePersist). Rule: When a hook under test uses setTimeout (debounce, retry, throttle), do NOT call vi.useFakeTimers() and then await waitFor(...) from @testing-library/react. RTL's waitFor polls on setInterval internally — fake timers freeze that interval, the poll never fires, the assertion never resolves, and the test times out at 5 s. Use real timers and await new Promise(r => setTimeout(r, debounceMs + buffer)) to advance past the debounce window. Tests stay <2 s each. Why: First pass of the drag-reorder hook tests used vi.useFakeTimers() + await vi.advanceTimersByTimeAsync(300) to flush the debounced PUT. All four hook tests timed out at 5000 ms with no other diagnostic. The error message ("Test timed out in 5000ms") points at the assertion, not at the cause; only re-reading waitFor's implementation makes it obvious that polling is on setInterval. Switching to real timers + an explicit 400 ms sleep made the same suite pass deterministically in ~450 ms per debounce-dependent test. How to apply: - Default to real timers in any test that mounts a component or runs a hook through renderHook + waitFor. - If you must fake timers (e.g. to test a long retry without burning real seconds), pass toFake to scope the fake to specific timers: vi.useFakeTimers({ toFake: ['setTimeout', 'clearTimeout'] }). Leave setInterval real so RTL's poll still fires. - For debounce assertions specifically, sleep for debounceMs + 150 ms of buffer — enough to cover the timer + the awaited fetch() PUT promise — then waitFor the post-PUT state.

Keyboard-sensor reorder needs real layout; assert activation, not movement¶

Context: frontend/src/widgets/__tests__/MyWidgetsRail.test.tsx, @dnd-kit/sortable's sortableKeyboardCoordinates. Rule: In jsdom, getBoundingClientRect() returns zeros for every element — there is no layout engine. @dnd-kit's keyboard sensor uses bounding rects to compute "what's the next sortable item below the active one?" when the user hits ArrowDown. Without real layout it can't resolve a meaningful over target, so onDragEnd either fires with over.id === active.id (no movement detected) or doesn't fire at all. Don't write unit tests that assert "ArrowDown moves item A past item B" — that is an integration concern and requires a real browser. Instead assert the activation contract: pressing Space on the drag handle starts a drag for the right active.id, pressing Escape cancels cleanly. That proves the keyboard sensor is wired without needing layout. Why: First pass of the MyWidgetsRail test pressed Space → ArrowDown → Space and asserted evt.over?.id was 'b' or 'c'. It always came back 'a' (or undefined). Time was lost suspecting userEvent, the activator ref, the sensor activation distance — all correct. The actual cause was that all three cards have a getBoundingClientRect() of {x: 0, y: 0, width: 0, height: 0}, so the keyboard coordinate getter has nothing to traverse. Browser-driven tests (Playwright / Cypress) would catch this trivially; jsdom cannot. How to apply: - Unit-test the contract: handle is focusable, Space starts a drag for the expected active.id, Escape cancels. Assert via onDragStart / onDragCancel callbacks attached to the test's DndContext. - For full reorder verification, push that to an E2E test (Playwright / Cypress) where layout is real. None exists yet for this repo; if/when it does, the existing widget-drag-handle-<id> test ids are the right hook. - Same trap applies to PointerSensor reorder behavior — the sensor itself works in jsdom (you can fire pointerdown/move/up), but collision detection that depends on rects will always return the same target. Don't write "drag from A to B" pointer tests in jsdom.

KPI delta color must derive from per-metric direction, not delta sign¶

Context: frontend/src/components/KpiTile.tsx, frontend/src/dashboard/sections.tsx (KPI_META), backend/app/metrics/schemas.py (future MetricDefinition.direction per ADR-007 follow-up). Rule: A KPI tile's delta color must be a function of (sign(deltaPct) XOR directionGood), not sign(deltaPct) alone. Every KPI carries an explicit directionGood: "up_good" | "down_good" | "neutral" (active_issues / claims_in_progress are down_good; csat / nba_taken_pct / cost_avoided_mtd are up_good). The arrow glyph still tracks the literal sign so direction-of-change is preserved; only the color flips on the XOR. Same shape: valueKind: "percent" | "score" | "currency" | "count" decides whether to suffix % on the delta — never sniff delta_label.includes("month"). Why: During the dashboard widget review (docs/plans/completed/dashboard-widget-fixes-from-review.md), the most-glanced widgets on the page were painting +12% Active Issues and +8% Claims in Progress green because the tile coupled "positive delta = green" with no sense of which direction was good for the business. For an ops command center this is the worst-case framing: the dashboard tells the executive "things are great" while the underlying signal is the opposite. Same class of bug shows up in any UI that color-codes deltas without per-metric direction — every new bad-up metric is a future regression. The string-sniff suffix logic (delta_label.includes("month")) was the same pathology in the suffix layer: a copy edit from "vs last month" to "month-over-month" would silently break the suffix on a percent KPI. How to apply: - For any new KPI tile, require directionGood and valueKind at the call site. Default to "up_good" only when truly intended; default valueKind to "percent" is fine for the legacy 15m KPI pattern but be explicit for currency / score / count. - Long-term, promote directionGood into MetricDefinition.direction in backend/app/metrics/schemas.py (ADR-007 already defines unit; direction is the natural sibling). Custom widgets pull both via MetricInfoBadge and don't have to re-encode the lookup. See docs/plans/active/promote-metric-direction-to-catalog.md. - Pin both behaviors in unit tests at frontend/src/components/__tests__/KpiTile.test.tsx: assert down_good + positive delta → bad (red) (the regression case) and assert the suffix logic against a delta_label that contains "month" but is a percent KPI.

Empty-state copy must not contradict an adjacent populated tile¶

Context: frontend/src/components/TopIssuesTable.tsx, backend/app/dashboard_state.py (top_issues aggregator), backend/app/kpis.py (refresh_kpis → kpi:active_issues). Rule: When two widgets render the same underlying signal (counts, totals, percentages) at different aggregations or windows, their query/window definitions must agree. Empty-state copy on the more-restrictive widget must NEVER read as a flat negation of the broader widget's value. Either (a) align the windows so the two can never disagree, or (b) make the empty-state copy explicit about the window difference ("0 issues in this 15m window — last issue X minutes ago" not "No active issues"). Why: kpi:active_issues is COUNT(*) FROM issue_sessions WHERE status IN ('open','in_progress') — no time bound. The Top Issues aggregator was filtering on opened_at > now() - interval '15 minutes'. Within 15 minutes of make demo-reset both agree, but as soon as the demo idles past the window the KPI tile renders 1,286 while the table directly beneath it renders "No active issues in the last 15 minutes". The contradiction is a credibility-killing UX defect even though each widget is locally correct. The hackathon dashboard had four widgets (Top Issues, heatmap, recommendations, alerts) any of which could in principle drift from the headline KPI — windowing alignment is a class concern, not a one-off bug. How to apply: - For any widget that breaks down a headline KPI, share the WHERE clause with the KPI's source query. If you must use a different window (genuine 15m / hour / day rollup), put the window in the API response (window_seconds: 900) and label the panel from that field — don't hardcode the window in the panel title. - Empty states for windowed widgets should show the freshness signal ("last issue X minutes ago"), never a flat "No data" that contradicts an adjacent tile. - Validate in scripts/verify-acceptance.sh — e.g. assert top_issues.length >= 1 when active_issues.value > 100. The contradiction is a property test, not a unit test.

Decorative `<button type="button">` clothing makes accessibility audits fail¶

Context: frontend/src/components/TopIssuesTable.tsx, frontend/src/components/RecommendationsPanel.tsx, frontend/src/components/AlertsFeed.tsx, frontend/src/components/TimelinePanel.tsx. Rule: Don't render decorative chrome ("View all", per-row "View") as <button type="button"> or <div> with hover:underline if the affordance has no destination. Either wire it to a real handler/route or render it as inert text (text-ink-400 cursor-default, no underline-on-hover). Same rule for chevrons next to disabled rows — drop them; <button disabled> quartet announces "unavailable button" four times to a screen reader. Why: Six "View all" / "View" affordances across four panels were <button type="button" className="hover:underline"> with no onClick. Sighted users got hover styling and a pointer cursor on click that did nothing. Keyboard users could tab to them and press Enter to no effect. Screen-reader users heard "View all, button" four times across the dashboard with nothing behind it. The Recommendations panel separately rendered four mock rows as <button disabled> with chevrons, announcing "Send proactive battery offer, dimmed button" four times in a row even though the rows could never be opened. Both anti-patterns trip standard a11y audits (axe-core, Lighthouse) and look amateurish in the keyboard/AT walkthrough. How to apply: - For each interactive-looking element ask "what happens when this is clicked?" If the answer is nothing, downgrade to <span className="text-ink-400"> (no hover, no cursor pointer, no focus stop). - For disabled rows in a list, render as <div role="presentation"> not <button disabled>. Drop the chevron / arrow glyphs that imply navigation. Reduce title opacity instead of relying on disabled styling. - When in doubt, run the dashboard through the keyboard tab order and have someone close their eyes through a VoiceOver pass — the pretenders surface within ten seconds.

Naive `;` split breaks any SQL with `;` in a string literal¶

Context: backend/scripts/databricks_mock_data/uploader.py execute_sql_file, called from make seed-databricks against backend/databricks_schema/02_tables.sql. Rule: When a script reads a multi-statement .sql file and dispatches each statement separately, do NOT split with body.split(";"). Walk the text char-by-char tracking whether you're inside a '...' (and "...") literal, and treat ; as a terminator only when you're outside any quote. Handle the doubled '' escape inside single-quoted strings. Why: The first make seed-databricks run failed mid-DDL with [PARSE_SYNTAX_ERROR] Syntax error at or near '''. The DDL had COMMENT 'Customer master. Tier 70/30 standard/premium; region weighted by US state population.'. The naive ; split sliced that statement in half — leaving the warehouse with an unterminated string literal and the next "statement" starting at region weighted.... Cost ~10 minutes (rebuild image, re-run seed against the warm warehouse, debug, fix, re-run). How to apply: - For any script that splits SQL files, use a quote-aware tokenizer (see _split_sql_statements in backend/scripts/databricks_mock_data/uploader.py). Track in_single, in_double, in_line_comment. Look for the doubled '' escape pattern. - This isn't a Databricks quirk — it bites Postgres, MySQL, SQLite. The same shape is in psql -f for a reason: the official client implements its own tokenizer. - Before adding any COMMENT / SELECT '...' to a file consumed by your splitter, lint by re-running the seed locally; the failure is fast and obvious.

Databricks SQL connector defaults retry for 15 minutes — override for interactive endpoints¶

Context: backend/app/databricks/client.py _open_connection, exercised by GET /v1/databricks/health. Rule: When opening a databricks.sql.connect(...) connection for an interactive endpoint (health check, request handler, anything <1s budget), explicitly pass _socket_timeout=<seconds>, _retry_stop_after_attempts_count=2, and _retry_stop_after_attempts_duration=<seconds>. The connector's defaults — 24 attempts over 900 seconds (15 minutes) — are tuned for unattended ETL, not for an operator hitting a curl. Why: The first negative-path test (point DATABRICKS_HOST at a bogus hostname, expect 503 in seconds) hung for >2 minutes before the user aborted. The connector dutifully retried the DNS-failed request 24 times with exponential backoff. In a demo, that translates to "the dashboard's 'Databricks unhealthy' badge takes 15 minutes to appear after the warehouse stops" — exactly the silent-failure mode ADR-008 forbids. How to apply: - For any sql.connect(...) call powering a route handler, set the three _retry_* / _socket_timeout kwargs to a tight budget. 30s socket timeout + 2 retries + 30s total duration is generous for Serverless Starter cold starts but still fails fast. - The kwargs are technically private (underscore-prefixed) but documented in databricks.sql.client.Connection.__init__ source. If the connector renames them in v4, the override should still surface as a clear connector error rather than a silent revert to defaults. - For long-running batch scripts (e.g. seed-databricks), the defaults are fine — accept a slow seed over a flaky one.

Bake DDL into the image alongside scripts that consume it¶

Context: backend/Dockerfile COPY databricks_schema ./databricks_schema, backend/scripts/databricks_mock_data/main.py SCHEMA_DIR. Rule: Any .sql file that a Python script in backend/scripts/ reads at runtime MUST be COPY'd into the image. Don't assume "it's in the repo so it's available" — make runs the script via docker exec, which only sees what the image baked. Why: First make seed-databricks run failed at Step 2/4 — Apply schema DDL with FileNotFoundError: '/app/databricks_schema/01_create_schema.sql'. The directory existed in the host repo but the Dockerfile hadn't been updated. Same root cause as the existing entry "Bake auxiliary code into the image, don't docker cp" — extending it here for the SQL-files-as-data variant. How to apply: - Every new directory under backend/ that a runtime script reads needs a COPY line in backend/Dockerfile AND a comment explaining the bake rationale. - Runtime test: after make up, docker exec 2026-hackathon-api-1 ls /app/<dir> should show the files. If it doesn't, the COPY is missing. - This rule generalizes: configs (config/*.yaml), DDL (databricks_schema/), fixtures (tests/fixtures/), seed data (scripts/seeds/*.json).

Free Trial Serverless Starter caps batched-INSERT throughput; right-size mock volumes¶

Context: backend/scripts/databricks_mock_data/generators.py generate_issue_sessions (default n=20000, was 50,000). Rule: When seeding a Databricks Free Trial Serverless Starter warehouse via databricks-sql-connector batched INSERT VALUES, target ≤25K rows per table. The warehouse's per-statement parse + plan + execute cycle is ~1-2 seconds even for 1000-row batches, so 50K rows = ~25 round-trips = ~50 seconds just for that table — multiplied across 7 tables, you blow MD-2's 5-minute budget. Why: First make seed-databricks run with MD-2's spec'd 50K issue_sessions hit ~8.5 minutes and the user aborted. After cutting to 20K and bumping batch size from 1000 → 2000, the full seed completed in 205s (3.4 min — under budget). Acceptance check B2's active_issues ~15K landed at ~4K instead, but the qualitative shape (regional skew, conversion rate, weekly cycle) is preserved — that's what the demo shows. How to apply: - Default mock-data generators to volumes that fit a 5-min seed budget on Serverless Starter (~75K total rows across all tables). - If a downstream test needs a specific count (Phase C's health check asserts rows_sampled == 5000), pin THAT table at the spec'd volume; right-size the others. - For larger demos, switch to COPY INTO from S3 (MD-2 line 255) or use a paid warehouse — but that's Prompt 3+ territory, not Prompt 1. - Document the deviation in the generator module's docstring AND in the plan's "Volume deviation" callout. Future-you reading the validation views (active_issues = 4138 instead of ~15000) will otherwise assume a regression.

Dual-DDL source of truth: `db/init.sql` AND `_TABLE_DDL` keep `metrics_catalog` honest¶

Context: db/init.sql (CREATE TABLE metrics_catalog ...), backend/app/metrics/catalog.py _TABLE_DDL, _LINEAGE_MIGRATIONS, called from app.main.lifespan via ensure_metrics_table. Rule: Any column added to metrics_catalog must land in BOTH db/init.sql (the fresh-DB path, used by docker-entrypoint-initdb.d and by make demo-reset against a fresh volume) AND _TABLE_DDL in backend/app/metrics/catalog.py (the warm-DB path, used by ensure_metrics_table at every api boot). Idempotent ALTER TABLE ... ADD COLUMN IF NOT EXISTS migrations in _LINEAGE_MIGRATIONS cover dev DBs that predate the column. Treat the two DDL sources as dual source of truth — no wiring promotes one over the other. Why: When Prompt 2 added the seven lineage columns (source_schema, source_table, source_query, last_validated_at, validation_status, governance_status, approved_by) to metrics_catalog, an early draft only edited db/init.sql. Fresh DBs (CI, make demo-reset on a fresh volume) would have the columns; warm dev DBs and the api's in-process ensure_metrics_table would not — and the api's CRUD layer would fail on writes the moment it tried to insert into a column the warm DB didn't have. The duplication is intentional: it ensures every code path that touches the table sees the same schema. The IF NOT EXISTS migrations are the safety net for devs whose volumes survived from before the column landed. How to apply: - For any change to metrics_catalog schema: edit db/init.sql CREATE TABLE block AND _TABLE_DDL in backend/app/metrics/catalog.py. Add an entry to _LINEAGE_MIGRATIONS for the same change so existing dev DBs auto-upgrade on next boot. - The same dual-source applies to any other table where the lifespan migration path runs — widgets, dashboard_layouts. Today's pattern is consistent across all three; preserve it. - Test: make demo-reset on both a fresh volume AND a warm volume. Both must produce a working SELECT * FROM metrics_catalog returning all columns. - This is a class concern — when the lifespan grows a new ensure_*_table, the dual-source rule extends to it too.

`MetricEntity` Literal entrenchment — keep entity bare, schema lives on `source_schema`¶

Context: backend/app/metrics/schemas.py MetricEntity = Literal['issue_session', 'claim', 'device', 'recommendation', 'alert', 'kpi', 'ev_claim', 'ev_product_catalog'], mirrored in frontend/src/widgets/types.ts. Widget specs, catalog rows, Bedrock tool schemas, and the Clarifier graph all reference this Literal — 14+ usages across backend + frontend at last count. Rule: When adding a Databricks-routed metric whose entity name conflicts with an existing schema-qualified path, KEEP the MetricEntity Literal as the bare table name ('ev_claim', not 'l3_asurion.ev_claim') and put the schema on a new field (source_schema='l3_asurion'). Promoting schema-qualified names into the Literal would force renames in every consumer — Bedrock tool schemas, widget specs in flight, the Clarifier intent.entity field, frontend dropdowns, JSON Schema validation in tests. The blast radius is far larger than it looks. Why: During Prompt 2 design we considered MetricEntity = Literal[..., 'l3_asurion.ev_claim', 'l3_asurion.ev_product_catalog'] so the schema could live in one place. A grep showed 14 distinct files would need updates, several inside JSON-schema string-matchers (Bedrock tool definitions). The entrenchment cost outweighed the elegance gain — the schema lives on source_schema instead, the entity stays bare, and the lineage column carries the qualifier wherever it needs to go. The catalog row is the join point. How to apply: - For a new Databricks metric: add the bare table name to MetricEntity if not already present (ev_claim was new; future tables follow the same shape). - Set source_schema='<warehouse_schema>' and source_table='<bare_table>' on the catalog row. Never invent a new schema-qualified entity. - The frontend mirror in frontend/src/widgets/types.ts MUST stay in lockstep — same Literal, same order. Tests in backend/tests/test_metric_routes.py assert this. - If a future surface really needs schema in the entity (e.g. cross-schema metric collisions), promote source_schema into a structured MetricSource model rather than touching MetricEntity.

Old custom-widget `custom_widget_placeholder` orphan caught by routing fail-loud¶

Context: config/metric_routing.yaml defensive entry for custom_widget_placeholder, backend/app/widgets/nodes/spec_synthesizer.py (synthesises the placeholder name when a custom widget lacks a resolved metric), backend/app/sql_gen/routing.py validate_routing_against_catalog. Rule: When introducing a fail-loud boot validator that compares metrics_catalog against a config file, run the validator against the CURRENT state of every dev DB before merging — not just a fresh DB. Dev DBs accumulate orphan rows from prior sessions (custom widgets that got persisted with synthesized placeholder names, abandoned experiments, partial seeds). The validator will catch them at boot, and "boot fails on every dev's machine until they manually TRUNCATE" is a hostile rollout. Why: First make up after wiring validate_routing_against_catalog failed with RuntimeError: Unmapped metric_ids in metrics_catalog: ['custom_widget_placeholder']. The custom_widget_placeholder row was synthesised by spec_synthesizer.py weeks earlier when a custom widget got persisted without a resolved metric. The fail-loud validator was working as designed — but the orphan made the api un-bootable, which then made make demo-reset un-runnable (chicken-and-egg, see next entry). Fix: defensive routing entry for custom_widget_placeholder (backend: postgres) so legitimate-but-stale orphans don't block boot, plus a one-time TRUNCATE for the immediate unblock. How to apply: - Whenever a new boot-validator-style gate is added, do a SELECT DISTINCT <key> FROM <table> on every reachable dev DB before merge and confirm the gate's reference set covers every value seen. - Keep a defensive routing entry for any synthesizer-generated placeholder names (custom_widget_placeholder etc.). The validator's job is catching real drift, not policing transient placeholder names from abandoned widgets. - For future fail-loud gates: add a "first-run drift survey" step to the rollout plan — produce the union of dev-DB values, diff against the config, ship a defensive config that covers the union.

`make demo-reset` chicken-and-egg when boot validator blocks startup¶

Context: Makefile demo-reset target (curl -X POST http://localhost:8000/v1/admin/reset-demo), app.main.lifespan's routing validator, db/init.sql. Rule: make demo-reset requires the api to be running, because it's a POST to a route handler. When the boot validator fails (e.g. orphan custom_widget_placeholder per the entry above), the api can't start, so make demo-reset can't run, so the orphan can't be cleared via the normal path. Document the fallback: docker exec <db-container> psql -U cmdcenter -d cmdcenter -c "TRUNCATE metrics_catalog CASCADE;". Why: The first time the routing validator caught the custom_widget_placeholder orphan, the natural reach was make demo-reset — except demo-reset is implemented as curl POST /v1/admin/reset-demo, and /v1/admin/reset-demo is a route handler, and the api process couldn't pass lifespan long enough to register the route. Five minutes wasted before the realisation that the recovery path is direct SQL via docker exec. Same shape will bite any future boot-time fail-loud gate. How to apply: - Document the direct-SQL escape hatch in Makefile (a demo-reset-hard target that does the TRUNCATE directly) and in CLAUDE.md runbook section. - For boot-blocking fail-loud paths, keep the validator AS-IS but make sure the recovery path doesn't depend on the api itself. Direct SQL via docker exec is the canonical fallback. - When designing future admin endpoints (reset-demo, governance approval, schema-validation-rerun), prefer routes that survive a partial boot — i.e. mount them on a router that runs even if lifespan raises. (Not free; requires explicit FastAPI configuration. Today's stack accepts the trade-off.) - Username gotcha: the Postgres user is cmdcenter, NOT appuser (the psql default $USER). docker exec 2026-hackathon-db-1 psql -U cmdcenter -d cmdcenter ... is the working invocation.

`make docs-validate` — pick image + ruleset that match the prototype's reality¶

Context: Makefile docs-validate target, mkdocs.yml plugins block, redocly.yaml extends: minimal. Rule: When a documentation gate (mkdocs build, redocly lint) fails on first run, fix it by lowering the rules to match what the codebase actually is, not by ratcheting the codebase up to enterprise hygiene. Three levers, in order of preference: 1. Container choice. squidfunk/mkdocs-material:latest does NOT bundle mkdocs-mermaid2-plugin. Use mkdocs-material's built-in mermaid via pymdownx.superfences.fence_code_format rendering mermaid blocks as <pre class="mermaid"> — the theme's bundled JS handles the rest. No plugin install needed. (When the docs land in real Backstage, the spotify/techdocs container DOES include mermaid2; switch then.) 2. mkdocs config. docs_dir: '.' is forbidden — config file's directory cannot be the docs dir. Use docs_dir: docs/ (the convention) and link top-level files (prd.md, CLAUDE.md) from inside docs via repo_url. Plus: don't run --strict on a repo where archival completed plans deliberately link to source files; use validation: { unrecognized_links: ignore } and drop --strict. 3. Redocly preset. The default recommended preset enforces servers, security, and 4xx responses on every operation — none of which a hackathon prototype has. Pin a redocly.yaml with extends: minimal so the gate validates the spec's correctness, not its production hygiene. When the spec graduates, swap to recommended and address rules item by item. Why: The first make docs-validate after wiring up the rollout failed three times in succession: missing mermaid2 Python module, forbidden docs_dir: '.', then 21 redocly errors from the recommended preset. Each was a tooling-default mismatch with prototype reality, not a real defect in the docs. Fighting the defaults wastes more time than configuring them. The total fix was ~30 minutes; ratcheting the spec to satisfy recommended would have been hours and yielded a worse fit (fake server URLs, fake security schemes). How to apply: - Before adding a doc-validation gate to CI, run it locally first and decide for each rule class: real defect, or tooling default that doesn't match this codebase? - Prefer config-file overrides (mkdocs.yml validation:, redocly.yaml extends:) to per-line ignores. They document the project's stance once, not noise on every line. - When a doc tool insists on a Python plugin (mermaid2, mkdocs-monorepo-plugin, etc.), check the rendering theme first — modern mkdocs-material handles many cases without plugins. - The existing Makefile docs-validate target is the canonical recipe: mkdocs build (no --strict) + redocly lint --config redocly.yaml. Both run in containers, no host deps.

Resolver-side imports of `app.sql_gen.generator` MUST be lazy¶

Context: backend/app/widgets/data_resolver.py:69-75 (cycle-breaking comment) + :387 (deferred from app.sql_gen.generator import generate_sql), backend/app/sql_gen/generator.py, backend/app/widgets/llm.py, backend/app/widgets/__init__.py, backend/app/widgets/routes.py, backend/app/main.py. Rule: Any module under backend/app/widgets/ that needs app.sql_gen.generate_sql MUST defer the import to inside the function that calls it — never at module top. Symmetrically, no code under backend/app/sql_gen/ may add a top-level import from backend/app/widgets/ other than the existing app.widgets.llm.get_llm (which is the documented bridge). Add an inline comment at the lazy-import site explaining the cycle so future "clean up" passes don't hoist it. Why: First import attempt of the new resolver crashed every test with ImportError: cannot import name 'generate_sql' from partially initialized module 'app.sql_gen.generator' (most likely due to a circular import). The chain: app.main → mounts SQL-gen routes → app.sql_gen.routes → app.sql_gen.generator → app.widgets.llm (for get_llm()) → app.widgets/__init__.py (package import side-effect) → app.widgets.routes (re-export so app.main can mount it) → app.widgets.data_resolver → app.sql_gen.generator (cycle). app.widgets.llm is the structural bridge — generator.py legitimately needs get_llm(), and app.widgets/__init__.py legitimately re-exports routes so main can mount them. Neither side can drop the dependency without rewiring boot. Lazy import inside _run_databricks (which is the only call site) breaks the cycle at zero ergonomic cost. How to apply: - For any new caller of app.sql_gen.generate_sql (or any other top-level export of app.sql_gen.generator) from inside app/widgets/: defer with def my_func(...): from app.sql_gen.generator import generate_sql; …. - Diagnostic phrase to recognize: partially initialized module in the ImportError message. That's always a circular dep, not a missing module. - Generalization: any time you add a new top-level dir under backend/app/ that calls into another top-level dir already used by app.main lifespan, sketch the import graph FIRST. The hackathon's eight-or-so packages are small enough that import cycles only appear when you cross between two routers that both need app.widgets.llm.

Pydantic-typed test fields don't dict-compare; normalize with `model_dump()`¶

Context: backend/tests/test_data_resolver.py:276-278 (test_postgres_happy_path_returns_real_rows), backend/app/widgets/data_resolver.py (DataSchemaColumn, DataResolverResponse). Rule: When asserting equality between a Pydantic-model field and a dict literal in tests, ALWAYS normalize the model first via model_dump(). assert model == {"a": 1} is False even when the model's fields exactly match the dict — Pydantic v2 models compare to other models, not to mappings. Same applies to lists of models: [c.model_dump() for c in coll] == [{...}, {...}]. Why: First green-then-red moment for the resolver test suite. The data shapes were correct (DataSchemaColumn(name='value', type='number') matches {"name": "value", "type": "number"} field-for-field), but the assertion failed with AssertionError: assert ([DataSchemaColumn(name='value', type='number')] == [{'name': 'value', 'type': 'number'}]). Pydantic v1 implemented __eq__ against any Mapping; Pydantic v2 dropped that behavior because it caused subtle bugs (e.g. dict-with-extra-keys appeared equal to a model). The repo runs Pydantic v2 throughout. Same trap waits for any future test that expects a Pydantic-typed list field on DataResolverResponse / WidgetSpec / MetricDefinition to dict-compare. How to apply: - For Pydantic fields under test, normalize before comparing: [c.model_dump() for c in coll] or result.model_dump(by_alias=True) for the whole response (use by_alias when the field has a Pydantic alias= like schema_ aliasing schema). - Defensive variant for mixed lists where some elements may already be dicts: [c.model_dump() if hasattr(c, "model_dump") else c for c in coll]. - Diagnostic shape: if an AssertionError shows parens-wrapped Pydantic reprs ([ModelName(field=...)]) on the left and plain dicts on the right, this is the bug. Don't waste time on the data — fix the comparison.

Reasoning-tier latency busts the SQL-gen budget — probe before committing¶

Context: backend/app/sql_gen/generator.py:537 (the SQL gen Bedrock call), config/sql_generator.yaml (generation_timeout_s: 5), ADR-011. Rule: When promoting a call site to a higher tier (per ADR-011), measure end-to-end latency against the call site's existing timeout budget BEFORE flipping the tier in code. The right tier for a call site is min(quality_required, latency_budget_allows) — not just "the smartest model that has tool-use access." Why: First pass at ADR-011 routed SQL gen to the reasoning tier (Opus 4.7) on the principle "this is multi-step reasoning over a multi-table dictionary slice." Live probe on 2026-05-06 showed Opus 4.7 clocked 11–19s on the same dictionary slice that Sonnet 4.6 handles in ~3–4s, against a 5s generation_timeout_s budget tuned for the demo loop. Result: every Databricks-routed widget on the dashboard rendered an amber bedrock_unavailable SourceBadge for healthy infrastructure — a regression with zero quality upside, because today's single Databricks-routed metric (claim_volume_l3_asurion, a COUNT(DISTINCT claim_id) over 30 days) is well within Sonnet 4.6's reach. The safety layer in app.sql_gen.safety is the actual correctness guarantee for SQL — not the model — so the marginal reasoning gain on simple cases doesn't justify breaking the budget. How to apply: - Before flipping get_llm("fast") → get_llm("reasoning") (or vice versa), run a one-off probe inside the api container that calls the same code path with both tiers and measures wall-clock time. Compare against the call site's timeout_s (whether it's widget_llm_timeout_s, generation_timeout_s, or a per-call override). - If reasoning-tier latency blows the budget AND the call site has a flat schema with a strong post-LLM safety net (e.g. SQL gen's app.sql_gen.safety, or any call site whose output is structurally validated by Python), prefer the fast tier. Document the trade-off in the ADR's call-site mapping table — don't just leave the call site as fast without context. - If reasoning-tier quality is genuinely required (deeply nested schemas, no post-LLM safety net), relax the budget instead. config/sql_generator.yaml is the right knob for SQL gen; widget_llm_timeout_s for the Clarifier. - Treat the post-LLM safety layer as evidence about model needs. A call site with a strong validator (Pydantic + custom checks) tolerates a smaller model; a call site whose output is consumed verbatim (codegen TSX, prose, decisions) needs the bigger one.

Reroute moments need a pre-existing column on the destination side¶

Context: PRD v2.1 §C.6.1 (the "reroute an existing tile from Postgres to Databricks with one YAML change" demo moment), backend/scripts/databricks_mock_data/l3_asurion_generators.py, data-dictionary/column_dictionary.csv, backend/app/metrics/seed.py (the cost_avoided_mtd row), config/metric_routing.yaml. Rule: A "live reroute" demo moment for any metric requires four pieces to all exist BEFORE curtain: (1) the destination warehouse table has the column the metric needs, (2) the canonical data dictionary lists it (so make validate-dictionary stays green), (3) the metrics_catalog.source_query for the metric uses that column, (4) the Postgres-side path keeps its hardcoded SQL allowlist entry untouched. Skipping any of the four turns the on-stage YAML edit into a runtime failure (boot validator, SQL gen 503, or zero-row result that looks like a bug). Why: The first attempt at the §C.6.1 reroute path for cost_avoided_mtd was narrative-only — the metric was Postgres-routed, and the plan was to "just flip the YAML for the demo." But the metric's source_query was empty (no Databricks-side template), the l3_asurion.ev_claim table didn't have a cost_avoided_usd column at all, and Bedrock would have happily generated SQL referencing a non-existent column → safety-layer rejection or a runtime PARSE_SYNTAX_ERROR. The actual reroute needs cost_avoided_usd in the seeder + dictionary + catalog source_query, all landing together in the same commit. Time cost when caught at first live curl: ~10 minutes (re-seed Databricks). Time cost if caught on stage: demo aborted. How to apply: - For any metric named in PRD §C.6 / §C.6.1 as a "reroute candidate", treat the four-step add as a single atomic change. Do NOT land the catalog source_query without also landing the seeder column and the dictionary entry, or vice versa. - Pre-validation gate: before declaring the reroute path ready, run both curls back-to-back — backend: postgres should return the v1 synthetic figure in <300ms; backend: databricks (after make up) should return real Databricks rows with source: bedrock in the cache-warm budget. Capture both response JSONs as receipts. - The Postgres allowlist in data_resolver._POSTGRES_QUERIES is the OTHER source of truth for the metric. Editing the catalog's default_filter.window does NOT change the Postgres path — that path's window is hardcoded in the allowlist's WHERE clause. Document the windowing on each side of the reroute (e.g. "Postgres = MTD via date_trunc; Databricks = trailing-30d") so the audience-visible value swap doesn't look like a bug. - After the four pieces are in place, exercise both --reset paths once: make seed-databricks-l3 ARGS=--reset AND TRUNCATE metrics_catalog && docker compose restart api (so seed_if_empty re-applies your new source_query).

Dual source of truth: `metrics_catalog.source_query` AND the Postgres allowlist¶

Context: backend/app/widgets/data_resolver.py _POSTGRES_QUERIES, backend/app/metrics/seed.py SEED_METRICS, config/metric_routing.yaml. Rule: Any metric that can be routed to either Postgres or Databricks (today: just cost_avoided_mtd, but the §C.6.1 pattern generalizes) lives in two places — the catalog's source_query column (Databricks template the LLM elaborates from) and _POSTGRES_QUERIES in data_resolver.py (the hardcoded Postgres SQL the resolver runs verbatim). Edits to one path do not affect the other; both must stay aligned in shape (single-row value column for KPIs, etc.) so the renderer code path is identical regardless of routing. Why: The Postgres path deliberately bypasses the SQL generator — it's the fast (<300ms p95) path that doesn't pay the Bedrock + sqlglot tax. So metric_routing.yaml: backend: postgres reads ZERO bytes of metrics_catalog.source_query; the resolver looks up the metric.name in the in-memory allowlist instead. This is by design — it preserves the v1 dashboard's tight latency budget — but it means a "fix the catalog and forget" mental model is a footgun. When cost_avoided_mtd's catalog row gained a Databricks source_query, the Postgres allowlist needed NO change; the Postgres path kept returning the v1 synthetic $1.41M because the allowlist's hardcoded SUM(cost_avoided) FROM outcomes WHERE recorded_at >= date_trunc('month', now()) is the literal source of truth there. How to apply: - When adding a new dual-routed metric, change BOTH places in the same commit and call out the dual-source-of-truth in the metric's definition text so a future reader doesn't try to "consolidate" them. - When the response shape needs to change (e.g. add a delta column), update both sides in lockstep — the renderer can't tolerate divergent shapes from the two backends. - Diagnostic: if a Postgres-routed widget still returns the OLD value after you "updated the metric", you almost certainly only edited the catalog. Check _POSTGRES_QUERIES next. - Future generalization: if more than one metric needs both paths, consider promoting the Postgres-side SQL into a column on metrics_catalog (e.g. postgres_query) so the dual-source-of-truth collapses. Until then, the in-code allowlist is the practical convention.

`seed_if_empty` is INSERT-only — TRUNCATE before re-seeding catalog edits¶

Context: backend/app/metrics/seed.py seed_if_empty (SELECT COUNT(*) FROM metrics_catalog; if 0: seed), backend/app/seed.py truncate_all (does NOT touch metrics_catalog). Rule: Edits to SEED_METRICS rows in backend/app/metrics/seed.py (definition, formula, source_query, etc.) DO NOT propagate to a dev DB on make up alone. The lifespan's seed_if_empty only runs when the catalog table is empty; warm dev DBs already have rows from previous boots, so the new seed values silently sit in code without being applied. To pick up the edit: docker exec 2026-hackathon-db-1 psql -U cmdcenter -d cmdcenter -c "TRUNCATE metrics_catalog;" then docker compose restart api so seed_if_empty fires on a now-empty table. Why: Discovered during the §C.6.1 reroute path build — added a Databricks source_query to the cost_avoided_mtd row in seed.py, ran make up, queried the catalog, found source_query IS NULL. The seed_catalog function iterates SEED_METRICS and calls create_metric ONLY when get_metric_by_name(conn, body.name) is None (line 303), so an existing row with the same name is skipped entirely. make demo-reset doesn't help either — truncate_all lists outcomes, ai_recommendations, alerts, events, claims, issue_sessions, devices, customers, operations_context, plans, widgets but not metrics_catalog, because the catalog is supposed to be stable across demo iterations. How to apply: - Workflow for catalog-row edits during development: docker exec 2026-hackathon-db-1 psql -U cmdcenter -d cmdcenter -c "TRUNCATE metrics_catalog;" && docker compose restart api. The api boots, sees the empty table, runs seed_if_empty, applies the new SEED_METRICS verbatim. Verify via docker exec ... psql -c "SELECT name, source_query FROM metrics_catalog WHERE name = '...';". - For a clean slate (drops widgets too, since they reference metric_id): docker exec 2026-hackathon-db-1 psql -U cmdcenter -d cmdcenter -c "TRUNCATE metrics_catalog, widgets RESTART IDENTITY CASCADE;" then restart. Widgets must be rebuilt via Add Widget Clarifier — same flow as a fresh-DB demo. - For a CI / fresh-machine path the issue doesn't appear because seed_if_empty runs against an empty table by construction. The trap is purely a warm-dev-DB pitfall. - Generalization: any "INSERT-on-empty" seeder pattern hides the same bug. If the seeder grows to >5 rows OR stops being purely additive (e.g. value updates on existing rows), promote it to a deterministic UPSERT (ON CONFLICT (name) DO UPDATE SET ...).

Inspector panel needs a clickable fallback once the live rec is consumed¶

Context: backend/app/mock_recs.py, backend/app/dashboard_state.py (customer_device_for), backend/app/main.py (/v1/feedback/outcome), frontend/src/components/RecommendationsPanel.tsx, frontend/src/components/CustomerDevicePanel.tsx. Rule: Any inspector that pivots on a "selected live thing" (the Customer & Device Detail panel pivots on selected_issue_session_id) must keep at least one row in the source list interactive after the live data is consumed — otherwise the operator approves the only live recommendation and stares at an empty Select an issue to inspect card with no way back. The honest fix is to make the existing visual-stub rows clickable, bound to synthetic detail payloads, with a "Sample data" pill on the destination panel so the user always sees they're looking at a stub. Do NOT silently substitute mock data into the live recommendations list — that violates Mocks must be opt-in. Why: First post-Approve dashboard state on 2026-05-06 had recs:active = [] (the seeded pending rec flipped to approved), so the recommendations panel rendered only the four visual-stub cards. Those cards were intentionally inert (isLive = !rec.mock gated the click handler), so every click was swallowed and the right panel stayed at Select an issue to inspect permanently — until a fresh make demo-reset. Asking an operator to run a make target to re-populate the demo is not a viable recovery path on stage. The fix is the same shape as the "Sample data" pill on the recommendations panel header: make the mocks first-class clickable surfaces, embed a synthetic Customer & Device payload alongside each mock_* recommendation in mock_recs.py (with a stable mock_session_* id), and short-circuit the customer_device_for lookup when the session id has the mock prefix. The destination panel grows its own "Sample data" pill so the synthetic Priya/Marcus/Alicia/Jordan record can never be confused with a real customer. How to apply: - For any future panel of shape "list-on-the-left + detail-on-the-right driven by a selected id", inventory the dead-end paths early: what happens when the list is empty? When all live rows are consumed? When no row is selected on cold boot? Each path needs an honest fallback or a recovery affordance. - Keep mocks honest — the "Sample data" pill on RecommendationsPanel is mirrored on CustomerDevicePanel by inspecting the mock_session_ prefix of the session id. ADR-008 is the rule; the prefix-detection is a stable convention because mock session ids never collide with Postgres UUIDs. - Backend short-circuits live in two places: dashboard_state.customer_device_for returns the synthetic card for any mock_session_* id without touching Postgres, and /v1/feedback/outcome returns {"status": "noop"} for any mock_* rec id so the Approve flow can be rehearsed without a UUID-cast SQL error. - The frontend selection handler (onSelect in App.tsx) drops the if (rec.mock || !rec.issue_session_id) return guard — mock recs are now selectable as long as they ship a stable issue_session_id. The "visual stub" subtext on the card and the "Sample data" pill on the detail panel preserve honesty. - Generalization: this pattern works for any "demo-only inspector" where the live data is scarce (single seeded row) but the UX needs a continuous interactive experience. Inert mocks are a UX liability if they coexist with consumable live data; clickable mocks with destination-side honesty cues are the better default.

Provenance pills belong in the metadata footer, not as `absolute` overlays¶

Context: frontend/src/widgets/WidgetPreview.tsx, frontend/src/widgets/MyWidgetsRail.tsx, frontend/src/widgets/SourceBadge.tsx. Rule: Provenance affordances (the green "Postgres · live" / purple "Databricks · live" / amber "Mock · live data unavailable" pill from SourceBadge) MUST render in the document flow inside the card's metadata footer — never as absolute left-2 bottom-2 overlays on top of the variable-height tile body. The tile body's content (KPI value + delta + caption, chart axes, table rows, custom-component output) is not under our control across all four widget types, and an absolute overlay at a fixed offset will eventually clip a caption, a label, or an axis tick. The footer bar (border-t border-ink-100 px-3 py-1.5) already exists for raw_input quote + "model assumed" — it is the metadata strip and the badge belongs there. Why: First implementation of SourceBadge (Prompt 5) wrapped it in <div className="pointer-events-auto absolute left-2 bottom-2 z-10"> inside WidgetPreview. On the Cost avoided (MTD) KPI tile that sits in MY WIDGETS, KpiTile renders the small "vs last month" caption near the bottom-left of the tile body — exactly where the pill landed. Result: the green pill sat ON TOP of the caption (*Cost avoided (MTD) — pre-validation widget for $C.4 1-minute path was hidden behind "Postgres · live"). The user reported it on 2026-05-06 with a screenshot that made the overlap unmistakable. Lifting the pill into the existing MyWidgetsRail footer bar — the same one that already carried the raw_input quote and the "custom" component pill — eliminated the overlap and grouped all metadata together; the rail's flex flex-wrap items-center gap-x-2 already accommodates a new chip without further layout work. How to apply: - For any new "I want a small contextual pill on a widget tile" requirement: render it in the MyWidgetsRail footer bar (same pattern the SourceBadge now uses), NOT as an absolute overlay inside WidgetPreview. The footer is the metadata strip — provenance, source, refresh hints, etc., all belong there together. - Reserved exception: MetricInfoBadge (the "i" trigger that opens the metric definition popover) intentionally stays as absolute right-2 top-2 on WidgetPreview. It's a discoverability affordance for the metric definition, not a status indicator, and the popover is portal-rendered (see MetricInfoBadge.tsx — uses createPortal to escape overflow-hidden). Anything else should use the footer. If a future affordance needs both a status meaning AND constant visibility, prefer the footer pattern; the user must scroll-to-see is acceptable for metadata. - Architectural note: the cleanest split is WidgetPreview renders the content (KPI/chart/table/custom + the metric-info-badge popover trigger), and the card chrome (title is on the body, footer with raw_input + provenance + assumptions) is owned by MyWidgetsRail. Don't push provenance back into WidgetPreview because the builder-modal preview path renders WidgetPreview standalone and has no live source — the provenance shape is rail-only by construction. - The useWidgetData hook returns {data, source, freshnessSeconds, liveDataUnavailable, generatedSql}. Pass data down to WidgetPreview as liveData, and render the rest in the footer adjacent to raw_input. One hook call per WidgetCard — keep the hook count stable per render.

Clarifier and SQL generator need a shared column naming contract¶

Context: backend/app/widgets/nodes/spec_synthesizer.py, backend/app/widgets/data_resolver.py, backend/app/sql_gen/generator.py Rule: When the metric has a source_query (Databricks-routed), extract the projected column names and inject them as a hard constraint into the spec_synthesizer prompt. The LLM must use these exact names in x_axis.field, series[].field, columns[].field, and mock_data keys. Why: The Clarifier invented abstract names ("status", "count") while the SQL generator faithfully used data dictionary names ("claim_status_code", "claim_count"). Recharts uses spec.x_axis.field as dataKey — row["status"] returned undefined, rendering empty bars despite 7 valid rows from Databricks. The data dictionary was only used during SQL generation; the Clarifier never saw it. Two independent LLM processes with no shared naming contract will always diverge. How to apply: - Any new LLM-to-LLM pipeline where one process generates a schema and another generates data must share the column/field naming contract explicitly. Don't assume two independent LLM calls will agree on names. - The fix: _extract_source_columns() in spec_synthesizer.py parses the metric's source_query via sqlglot.parse_one(dialect='databricks') and injects the column names as a mandatory contract section into the synthesizer prompt. Rule #7 in spec_synthesizer.md enforces this. - For Postgres-only metrics (no source_query), the existing behavior is fine — they use hardcoded _POSTGRES_QUERIES with intentional column names like AS value.

Lessons Learned¶

Table of contents¶

Stale containers hide UI work¶

Bedrock model access is per-feature, not per-account¶

pydantic-settings parses empty env strings strictly for booleans¶

LangGraph state inheritance — subclass, don't redefine channels¶

Bake auxiliary code into the image, don't docker cp¶

Mount host ~/.aws into the api container¶

Stale shell-side AWS creds poison make up — scrub in the recipe¶

Static-check regexes — anchor to actual literal shape¶

The Clarifier topology is reusable; the schema is not¶

Prompt engineering still matters with structured output¶

Promote eval schemas, don't import from scripts/¶

Bedrock tool-use rejects top-level oneOf schemas¶