Skip to content

ADR-008: Mocks are opt-in, never silent fallback

Status: Accepted Source: prd.md §19

Context

ADR-002 introduced MockLlm so the demo could run on a laptop with no AWS credentials. ADR-005 and ADR-007 inherited that pattern through get_llm() plus try: bedrock; except LlmError: MockLlm() in the Clarifier nodes. Across the metric-aware Clarifier work this fallback silently masked six distinct real failures back-to-back (tool-use schema rejection, deeply-nested-field truncation by Haiku 4.5, ~/.aws mounted read-only by Docker, wrong AWS profile, env-var drift across docker compose up invocations, marketplace access not granted for the configured BEDROCK_MODEL_ID). In every case the persisted spec carried the deterministic display_name: GeneratedMockCustomCard and assumptions: ["Offline MockLlm — deterministic placeholder…"], but the operator (and sometimes the agent) saw only "the widget shipped" and moved on. The fallback turned every Bedrock infrastructure failure into a Potemkin success — exactly the failure mode the docs/lessons-learned.md doc had warned about, repeating six times in 48 hours.

Decision

  1. BUILDER_MODE is the single switch. backend/app/settings.py gains builder_mode: Literal["live", "offline"] = "live" and a resolved_builder_mode() helper that ORs with the legacy USE_BEDROCK env var. live is the default; missing / broken AWS surfaces as an error rather than a silent fallback.
  2. get_llm() returns Mock only in offline mode. backend/app/widgets/llm.py: if mode == "offline": return MockLlm(). In live mode, Bedrock init failures raise a new BuilderModeError (subclass of LlmError) — never MockLlm(). Per-node except LlmError: MockLlm().generate_json(...) stanzas in intent_extractor.py and spec_synthesizer.py are deleted.
  3. The runner classifies and forwards exceptions. backend/app/widgets/runner.py _classify_exc translates BuilderModeError → kind: "builder_unavailable", LlmError → kind: "llm_error", else → "unknown", attaches the resolved builder_mode, and emits the SSE event: error payload. The result snapshot also carries builder_mode so the frontend can branch UX.
  4. The frontend renders a clear failure UI. frontend/src/widgets/useWidgetClarifier.ts tracks errorKind and builderMode alongside error. frontend/src/widgets/WidgetBuilderModal.tsx renders a rose error banner ("LLM unavailable (Bedrock)" + actionable copy + make up-offline instruction) when errorKind === "builder_unavailable", instead of silently rendering a placeholder spec.
  5. The header always shows the resolved mode. frontend/src/components/Header.tsx renders an amber OFFLINE MODE pill when dashboard_state.builder_mode === "offline" so the operator can never confuse the two paths. dashboard_state exposes builder_mode via backend/app/dashboard_state.py.
  6. make exposes both paths explicitly. make up runs live (default BUILDER_MODE=live, USE_BEDROCK=true); make up-offline runs offline (BUILDER_MODE=offline USE_BEDROCK=false). Demo reviewers picking the offline path get the badge and the deterministic mock; everyone else gets real Bedrock or a hard error.

Consequences

  • AWS misconfiguration is now visible in seconds instead of after a debugging session. The end-to-end test of breaking AWS_PROFILE on purpose now produces a clean rose banner saying "boto3 init failed: The config profile (does-not-exist) could not be found" — exactly the message you need to fix the problem.
  • ADR-002's offline-demo guarantee is preserved (make up-offline still produces a working preview/persist flow against MockLlm), but the silent-substitution failure mode is gone.
  • Six lessons-learned entries about Bedrock infra (model access, SSO cache, env-var drift, etc.) move from "footgun you have to remember" to "footgun the system tells you about". They're still in the doc as historical context but the new entry Mocks must be opt-in, never silent fallback governs.
  • One small backwards-incompatibility for local dev: USE_BEDROCK=false alone now flips the resolved mode to offline (it always did, but now the UI signals it). Anyone who had USE_BEDROCK=false in a local env file and was relying on the previous behavior of "Bedrock disabled but no offline pill" should explicitly set BUILDER_MODE=offline and live with the badge. This is the intended outcome.
  • This ADR's fail-loud discipline extends to all new fail-loud gates: ADR-PROTO-002 (free-text SQL rejection), the Databricks health endpoint (RFC 7807 503), and the backend/app/sql_gen/routing.py boot validator (PRD v2.1 §C.5.3) all mirror this shape.

Cross-references