ADR-011: Two-tier Bedrock model selection (fast vs reasoning)¶

Status: Accepted Source: new (driven by 2026-05-06 model-access probe + accumulated lessons-learned) Date: 2026-05-06

Context¶

Until today the entire backend used a single BEDROCK_MODEL_ID env var, resolved at import time in app.settings.Settings. That coupled four very different LLM workloads to one model:

Clarifier intent extractor (backend/app/widgets/nodes/intent_extractor.py:111) — single-pass parse from a user prompt to a flat structured payload (type, mode, metric_id_guess, dimensions, time window).
Clarifier data spec synthesizer (backend/app/widgets/nodes/spec_synthesizer.py:265) — composes a deeply-nested WidgetSpec (kpi/chart/table) with an internally-consistent metric block, axes, series, columns, mock_data, and data_intent.
Custom-widget codegen (backend/app/widgets/nodes/spec_synthesizer.py:557) — generates ~100 lines of TSX that must compile under Babel, render under React, type-check against a generated Props interface, and respect the no-imports contract from ADR-006.
SQL generator (backend/app/sql_gen/generator.py:537) — translates a MetricDefinition plus a dictionary slice (potentially several tables, dozens of columns, plus AI guidelines) into a single safe SELECT that passes the safety layer in app.sql_gen.safety.

Two anchor pieces of evidence forced the rethink:

docs/lessons-learned.md § Haiku 4.5 silently drops deeply-nested fields — split into stages — the codegen path required a Python-side staging workaround because the only model with tool-use access (Haiku 4.5) truncated nested required fields.

A 2026-05-06 model-access probe against hackathon-async:

Model	Tool-use status
`us.anthropic.claude-sonnet-4-20250514-v1:0`	`LEGACY` — AccessDenied
`us.anthropic.claude-sonnet-4-5-20250929-v1:0`	AWS Marketplace subscription not enabled
`us.anthropic.claude-sonnet-4-6`	Works ✓
`us.anthropic.claude-opus-4-5-20251101-v1:0`	Works ✓
`us.anthropic.claude-opus-4-7`	Works ✓
`us.anthropic.claude-haiku-4-5-20251001-v1:0`	Works (drops nested fields)

The original lesson was written when Sonnet 4 + Haiku 4.5 were the only options. That world no longer exists.

A single global model id forces a global trade-off — pick fast/cheap and the reasoning surfaces (spec synth + codegen + SQL gen) regress, pick top-tier and the cheap fan-out paths (intent extractor) become 5–10× more expensive than they need to be, and a future surface needs a third opinion entirely.

Decision¶

Introduce two model tiers as first-class concepts, each backed by an independent setting on app.settings.Settings:
- bedrock_model_id (env: BEDROCK_MODEL_ID) — fast tier, default us.anthropic.claude-sonnet-4-6.
- bedrock_reasoning_model_id (env: BEDROCK_REASONING_MODEL_ID) — reasoning tier, default us.anthropic.claude-opus-4-7.
Extend the LLM factory app.widgets.llm.get_llm with a typed purpose: Literal["fast", "reasoning"] = "fast" argument. Tier-to-id resolution is the factory's job, encapsulated in _resolve_model_id(purpose). Call sites name the intent, never the model id directly.

Annotate each existing call site with its tier choice and a short justification:

File:line	Tier	Why
`backend/app/widgets/nodes/intent_extractor.py:111`	`fast`	Single-pass parse, downstream nodes catch misclassification
`backend/app/widgets/nodes/spec_synthesizer.py:265`	`reasoning`	Multi-field WidgetSpec with internally-consistent nested objects
`backend/app/widgets/nodes/spec_synthesizer.py:557`	`reasoning`	Custom-widget TSX codegen — heaviest reasoning surface in the stack
`backend/app/sql_gen/generator.py:537`	`fast`	Flat tool-input schema (`{ sql, tables_used, explanation }`) + 5s `generation_timeout_s`; Sonnet 4.6 lands in ~3–4s against today's single-metric workload, Opus 4.7 clocked 11–19s on the same slice on 2026-05-06 (see Consequences). Safety layer in `app.sql_gen.safety` is the actual correctness guarantee, not the model. Phase 2 multi-table joins can opt into reasoning per call site or relax the budget in `config/sql_generator.yaml`.

The default for get_llm() (no argument) stays fast so existing callers behave the same after the refactor — reasoning-heavy callers opt in explicitly.
Offline mode (BUILDER_MODE=offline) continues to bypass tiering entirely — MockLlm is purpose-agnostic by design (the canned payloads do not depend on the model). The factory drops the purpose label on the floor in offline mode without error or warning.
Per-PRD _DEFAULT_BEDROCK_MODEL in backend/app/sql_gen/generator.py:103 is removed. It was dead code masquerading as configuration — the model came from settings.bedrock_model_id via get_llm() regardless of that constant's value. Module docstring updated to reflect the tier choice.
Pin both tiers in .env plus AWS_PROFILE=hackathon-async. Without this, a make up from a fresh shell silently reverts to the docker-compose default, and a stale-profile SSO cache produces ExpiredTokenException at first invoke. Documented in docs/lessons-learned.md § Watch the live env vars on make up.

Consequences¶

Spec synth and codegen quality jump sharply. The reasoning tier on Opus 4.7 retires the Haiku 4.5 silently drops deeply-nested fields failure class. The Python-side staging workaround in spec_synthesizer._custom_synth stays in place defensively (cheap insurance — if a future demo throttles Opus quota, the fallback still produces a valid spec).
SQL gen stays on the fast tier intentionally. A 2026-05-06 live probe routed SQL gen to Opus 4.7 and observed 11–19s end-to-end against the existing generation_timeout_s: 5 budget in config/sql_generator.yaml. The widget data resolver therefore degraded to bedrock_unavailable and rendered an amber SourceBadge for healthy infrastructure — a regression with no quality upside on today's single-metric workload (COUNT(DISTINCT claim_id) over 30 days). Sonnet 4.6 lands in ~3–4s on the same slice with full tool-use access and no dropped-field failures (the safety layer in app.sql_gen.safety is what actually enforces correctness, not the model). Escalation paths for Phase 2 multi-table joins / window functions: (a) opt the call site into the reasoning tier locally with get_llm("reasoning"), (b) relax generation_timeout_s to ~25s in the YAML, or © split the generator into a routing pass (fast) plus a hard-case pass (reasoning). Until Phase 2 produces a metric where Sonnet 4.6 actually fails, none of these are warranted.
Per-token cost rises ~3–10× on the reasoning surfaces. Negligible at the prototype's call volume (one Clarifier round + one spec synth + one SQL gen per widget creation, plus cached resolver hits on the dashboard) — well under $0.05 per widget end-to-end.
Two env vars instead of one. The lessons-learned discipline of verify with docker exec ... env | grep BEDROCK_ after every make up now requires both vars in the grep. .env carries both as committed defaults so any shell's make up stays consistent.
Tier escalation is a one-line change. A future surface that wants the reasoning tier just calls get_llm("reasoning"); a future tier (e.g. "vision" for a screenshot-grounded synthesizer) is a new branch in _resolve_model_id plus a new setting — not a refactor.
Out of scope: per-call-site model overrides (e.g. codegen always uses Opus 4.7 even if reasoning tier is downgraded to Sonnet). Two tiers cover today's needs; if a future demand wants finer granularity, add a third tier rather than overriding per call site.

Cross-references¶

ADR-002 — original Bedrock client introduction.
ADR-005 — Add Widget Clarifier (LangGraph) where three of the four call sites live.
ADR-006 — custom-widget renderer + no-imports contract that codegen must respect.
ADR-008 — mocks-as-opt-in. The purpose argument is dropped on the floor in offline mode by design.
Implementation: backend/app/settings.py (tier settings), backend/app/widgets/llm.py (factory), backend/tests/test_llm_factory.py (contract tests).
Probe receipts + model-access history: docs/lessons-learned.md § Bedrock model access is per-feature, not per-account.