Skip to content

ADR-011: Two-tier Bedrock model selection (fast vs reasoning)

Status: Accepted Source: new (driven by 2026-05-06 model-access probe + accumulated lessons-learned) Date: 2026-05-06

Context

Until today the entire backend used a single BEDROCK_MODEL_ID env var, resolved at import time in app.settings.Settings. That coupled four very different LLM workloads to one model:

  1. Clarifier intent extractor (backend/app/widgets/nodes/intent_extractor.py:111) — single-pass parse from a user prompt to a flat structured payload (type, mode, metric_id_guess, dimensions, time window).
  2. Clarifier data spec synthesizer (backend/app/widgets/nodes/spec_synthesizer.py:265) — composes a deeply-nested WidgetSpec (kpi/chart/table) with an internally-consistent metric block, axes, series, columns, mock_data, and data_intent.
  3. Custom-widget codegen (backend/app/widgets/nodes/spec_synthesizer.py:557) — generates ~100 lines of TSX that must compile under Babel, render under React, type-check against a generated Props interface, and respect the no-imports contract from ADR-006.
  4. SQL generator (backend/app/sql_gen/generator.py:537) — translates a MetricDefinition plus a dictionary slice (potentially several tables, dozens of columns, plus AI guidelines) into a single safe SELECT that passes the safety layer in app.sql_gen.safety.

Two anchor pieces of evidence forced the rethink:

  • docs/lessons-learned.md § Haiku 4.5 silently drops deeply-nested fields — split into stages — the codegen path required a Python-side staging workaround because the only model with tool-use access (Haiku 4.5) truncated nested required fields.
  • A 2026-05-06 model-access probe against hackathon-async:

    Model Tool-use status
    us.anthropic.claude-sonnet-4-20250514-v1:0 LEGACY — AccessDenied
    us.anthropic.claude-sonnet-4-5-20250929-v1:0 AWS Marketplace subscription not enabled
    us.anthropic.claude-sonnet-4-6 Works
    us.anthropic.claude-opus-4-5-20251101-v1:0 Works
    us.anthropic.claude-opus-4-7 Works
    us.anthropic.claude-haiku-4-5-20251001-v1:0 Works (drops nested fields)

    The original lesson was written when Sonnet 4 + Haiku 4.5 were the only options. That world no longer exists.

A single global model id forces a global trade-off — pick fast/cheap and the reasoning surfaces (spec synth + codegen + SQL gen) regress, pick top-tier and the cheap fan-out paths (intent extractor) become 5–10× more expensive than they need to be, and a future surface needs a third opinion entirely.

Decision

  1. Introduce two model tiers as first-class concepts, each backed by an independent setting on app.settings.Settings:

    • bedrock_model_id (env: BEDROCK_MODEL_ID) — fast tier, default us.anthropic.claude-sonnet-4-6.
    • bedrock_reasoning_model_id (env: BEDROCK_REASONING_MODEL_ID) — reasoning tier, default us.anthropic.claude-opus-4-7.
  2. Extend the LLM factory app.widgets.llm.get_llm with a typed purpose: Literal["fast", "reasoning"] = "fast" argument. Tier-to-id resolution is the factory's job, encapsulated in _resolve_model_id(purpose). Call sites name the intent, never the model id directly.

  3. Annotate each existing call site with its tier choice and a short justification:

    File:line Tier Why
    backend/app/widgets/nodes/intent_extractor.py:111 fast Single-pass parse, downstream nodes catch misclassification
    backend/app/widgets/nodes/spec_synthesizer.py:265 reasoning Multi-field WidgetSpec with internally-consistent nested objects
    backend/app/widgets/nodes/spec_synthesizer.py:557 reasoning Custom-widget TSX codegen — heaviest reasoning surface in the stack
    backend/app/sql_gen/generator.py:537 fast Flat tool-input schema ({ sql, tables_used, explanation }) + 5s generation_timeout_s; Sonnet 4.6 lands in ~3–4s against today's single-metric workload, Opus 4.7 clocked 11–19s on the same slice on 2026-05-06 (see Consequences). Safety layer in app.sql_gen.safety is the actual correctness guarantee, not the model. Phase 2 multi-table joins can opt into reasoning per call site or relax the budget in config/sql_generator.yaml.
  4. The default for get_llm() (no argument) stays fast so existing callers behave the same after the refactor — reasoning-heavy callers opt in explicitly.

  5. Offline mode (BUILDER_MODE=offline) continues to bypass tiering entirely — MockLlm is purpose-agnostic by design (the canned payloads do not depend on the model). The factory drops the purpose label on the floor in offline mode without error or warning.

  6. Per-PRD _DEFAULT_BEDROCK_MODEL in backend/app/sql_gen/generator.py:103 is removed. It was dead code masquerading as configuration — the model came from settings.bedrock_model_id via get_llm() regardless of that constant's value. Module docstring updated to reflect the tier choice.

  7. Pin both tiers in .env plus AWS_PROFILE=hackathon-async. Without this, a make up from a fresh shell silently reverts to the docker-compose default, and a stale-profile SSO cache produces ExpiredTokenException at first invoke. Documented in docs/lessons-learned.md § Watch the live env vars on make up.

Consequences

  • Spec synth and codegen quality jump sharply. The reasoning tier on Opus 4.7 retires the Haiku 4.5 silently drops deeply-nested fields failure class. The Python-side staging workaround in spec_synthesizer._custom_synth stays in place defensively (cheap insurance — if a future demo throttles Opus quota, the fallback still produces a valid spec).
  • SQL gen stays on the fast tier intentionally. A 2026-05-06 live probe routed SQL gen to Opus 4.7 and observed 11–19s end-to-end against the existing generation_timeout_s: 5 budget in config/sql_generator.yaml. The widget data resolver therefore degraded to bedrock_unavailable and rendered an amber SourceBadge for healthy infrastructure — a regression with no quality upside on today's single-metric workload (COUNT(DISTINCT claim_id) over 30 days). Sonnet 4.6 lands in ~3–4s on the same slice with full tool-use access and no dropped-field failures (the safety layer in app.sql_gen.safety is what actually enforces correctness, not the model). Escalation paths for Phase 2 multi-table joins / window functions: (a) opt the call site into the reasoning tier locally with get_llm("reasoning"), (b) relax generation_timeout_s to ~25s in the YAML, or © split the generator into a routing pass (fast) plus a hard-case pass (reasoning). Until Phase 2 produces a metric where Sonnet 4.6 actually fails, none of these are warranted.
  • Per-token cost rises ~3–10× on the reasoning surfaces. Negligible at the prototype's call volume (one Clarifier round + one spec synth + one SQL gen per widget creation, plus cached resolver hits on the dashboard) — well under $0.05 per widget end-to-end.
  • Two env vars instead of one. The lessons-learned discipline of verify with docker exec ... env | grep BEDROCK_ after every make up now requires both vars in the grep. .env carries both as committed defaults so any shell's make up stays consistent.
  • Tier escalation is a one-line change. A future surface that wants the reasoning tier just calls get_llm("reasoning"); a future tier (e.g. "vision" for a screenshot-grounded synthesizer) is a new branch in _resolve_model_id plus a new setting — not a refactor.
  • Out of scope: per-call-site model overrides (e.g. codegen always uses Opus 4.7 even if reasoning tier is downgraded to Sonnet). Two tiers cover today's needs; if a future demand wants finer granularity, add a third tier rather than overriding per call site.

Cross-references