ADR-011: Two-tier Bedrock model selection (fast vs reasoning)¶
Status: Accepted Source: new (driven by 2026-05-06 model-access probe + accumulated lessons-learned) Date: 2026-05-06
Context¶
Until today the entire backend used a single BEDROCK_MODEL_ID env var, resolved at import time in app.settings.Settings. That coupled four very different LLM workloads to one model:
- Clarifier intent extractor (
backend/app/widgets/nodes/intent_extractor.py:111) — single-pass parse from a user prompt to a flat structured payload (type,mode,metric_id_guess, dimensions, time window). - Clarifier data spec synthesizer (
backend/app/widgets/nodes/spec_synthesizer.py:265) — composes a deeply-nestedWidgetSpec(kpi/chart/table) with an internally-consistentmetricblock, axes, series, columns,mock_data, anddata_intent. - Custom-widget codegen (
backend/app/widgets/nodes/spec_synthesizer.py:557) — generates ~100 lines of TSX that must compile under Babel, render under React, type-check against a generatedPropsinterface, and respect the no-imports contract from ADR-006. - SQL generator (
backend/app/sql_gen/generator.py:537) — translates aMetricDefinitionplus a dictionary slice (potentially several tables, dozens of columns, plus AI guidelines) into a single safeSELECTthat passes the safety layer inapp.sql_gen.safety.
Two anchor pieces of evidence forced the rethink:
docs/lessons-learned.md§ Haiku 4.5 silently drops deeply-nested fields — split into stages — the codegen path required a Python-side staging workaround because the only model with tool-use access (Haiku 4.5) truncated nested required fields.-
A 2026-05-06 model-access probe against
hackathon-async:Model Tool-use status us.anthropic.claude-sonnet-4-20250514-v1:0LEGACY— AccessDeniedus.anthropic.claude-sonnet-4-5-20250929-v1:0AWS Marketplace subscription not enabled us.anthropic.claude-sonnet-4-6Works ✓ us.anthropic.claude-opus-4-5-20251101-v1:0Works ✓ us.anthropic.claude-opus-4-7Works ✓ us.anthropic.claude-haiku-4-5-20251001-v1:0Works (drops nested fields) The original lesson was written when Sonnet 4 + Haiku 4.5 were the only options. That world no longer exists.
A single global model id forces a global trade-off — pick fast/cheap and the reasoning surfaces (spec synth + codegen + SQL gen) regress, pick top-tier and the cheap fan-out paths (intent extractor) become 5–10× more expensive than they need to be, and a future surface needs a third opinion entirely.
Decision¶
-
Introduce two model tiers as first-class concepts, each backed by an independent setting on
app.settings.Settings:bedrock_model_id(env:BEDROCK_MODEL_ID) — fast tier, defaultus.anthropic.claude-sonnet-4-6.bedrock_reasoning_model_id(env:BEDROCK_REASONING_MODEL_ID) — reasoning tier, defaultus.anthropic.claude-opus-4-7.
-
Extend the LLM factory
app.widgets.llm.get_llmwith a typedpurpose: Literal["fast", "reasoning"] = "fast"argument. Tier-to-id resolution is the factory's job, encapsulated in_resolve_model_id(purpose). Call sites name the intent, never the model id directly. -
Annotate each existing call site with its tier choice and a short justification:
File:line Tier Why backend/app/widgets/nodes/intent_extractor.py:111fastSingle-pass parse, downstream nodes catch misclassification backend/app/widgets/nodes/spec_synthesizer.py:265reasoningMulti-field WidgetSpec with internally-consistent nested objects backend/app/widgets/nodes/spec_synthesizer.py:557reasoningCustom-widget TSX codegen — heaviest reasoning surface in the stack backend/app/sql_gen/generator.py:537fastFlat tool-input schema ( { sql, tables_used, explanation }) + 5sgeneration_timeout_s; Sonnet 4.6 lands in ~3–4s against today's single-metric workload, Opus 4.7 clocked 11–19s on the same slice on 2026-05-06 (see Consequences). Safety layer inapp.sql_gen.safetyis the actual correctness guarantee, not the model. Phase 2 multi-table joins can opt into reasoning per call site or relax the budget inconfig/sql_generator.yaml. -
The default for
get_llm()(no argument) staysfastso existing callers behave the same after the refactor — reasoning-heavy callers opt in explicitly. -
Offline mode (
BUILDER_MODE=offline) continues to bypass tiering entirely —MockLlmis purpose-agnostic by design (the canned payloads do not depend on the model). The factory drops thepurposelabel on the floor in offline mode without error or warning. -
Per-PRD
_DEFAULT_BEDROCK_MODELinbackend/app/sql_gen/generator.py:103is removed. It was dead code masquerading as configuration — the model came fromsettings.bedrock_model_idviaget_llm()regardless of that constant's value. Module docstring updated to reflect the tier choice. -
Pin both tiers in
.envplusAWS_PROFILE=hackathon-async. Without this, amake upfrom a fresh shell silently reverts to the docker-compose default, and a stale-profile SSO cache producesExpiredTokenExceptionat first invoke. Documented indocs/lessons-learned.md§ Watch the live env vars onmake up.
Consequences¶
- Spec synth and codegen quality jump sharply. The reasoning tier on Opus 4.7 retires the Haiku 4.5 silently drops deeply-nested fields failure class. The Python-side staging workaround in
spec_synthesizer._custom_synthstays in place defensively (cheap insurance — if a future demo throttles Opus quota, the fallback still produces a valid spec). - SQL gen stays on the fast tier intentionally. A 2026-05-06 live probe routed SQL gen to Opus 4.7 and observed 11–19s end-to-end against the existing
generation_timeout_s: 5budget inconfig/sql_generator.yaml. The widget data resolver therefore degraded tobedrock_unavailableand rendered an amberSourceBadgefor healthy infrastructure — a regression with no quality upside on today's single-metric workload (COUNT(DISTINCT claim_id)over 30 days). Sonnet 4.6 lands in ~3–4s on the same slice with full tool-use access and no dropped-field failures (the safety layer inapp.sql_gen.safetyis what actually enforces correctness, not the model). Escalation paths for Phase 2 multi-table joins / window functions: (a) opt the call site into the reasoning tier locally withget_llm("reasoning"), (b) relaxgeneration_timeout_sto ~25s in the YAML, or © split the generator into a routing pass (fast) plus a hard-case pass (reasoning). Until Phase 2 produces a metric where Sonnet 4.6 actually fails, none of these are warranted. - Per-token cost rises ~3–10× on the reasoning surfaces. Negligible at the prototype's call volume (one Clarifier round + one spec synth + one SQL gen per widget creation, plus cached resolver hits on the dashboard) — well under $0.05 per widget end-to-end.
- Two env vars instead of one. The lessons-learned discipline of verify with
docker exec ... env | grep BEDROCK_after everymake upnow requires both vars in the grep..envcarries both as committed defaults so any shell'smake upstays consistent. - Tier escalation is a one-line change. A future surface that wants the reasoning tier just calls
get_llm("reasoning"); a future tier (e.g."vision"for a screenshot-grounded synthesizer) is a new branch in_resolve_model_idplus a new setting — not a refactor. - Out of scope: per-call-site model overrides (e.g. codegen always uses Opus 4.7 even if reasoning tier is downgraded to Sonnet). Two tiers cover today's needs; if a future demand wants finer granularity, add a third tier rather than overriding per call site.
Cross-references¶
- ADR-002 — original Bedrock client introduction.
- ADR-005 — Add Widget Clarifier (LangGraph) where three of the four call sites live.
- ADR-006 — custom-widget renderer + no-imports contract that codegen must respect.
- ADR-008 — mocks-as-opt-in. The
purposeargument is dropped on the floor in offline mode by design. - Implementation: backend/app/settings.py (tier settings), backend/app/widgets/llm.py (factory), backend/tests/test_llm_factory.py (contract tests).
- Probe receipts + model-access history: docs/lessons-learned.md § Bedrock model access is per-feature, not per-account.