19. Error Visibility Policy

Status

Accepted — 2026-04-26

Context

The project has accumulated error handling patterns organically across frontend components, Cloud Functions, and infrastructure. Some are well-surfaced (Ernest AI maps distinct HttpsError codes to distinct COPY keys). Others are swallowed (LinkedIn dispatch silently disables when Remote Config fails, returning HTTP 200 to Cloud Scheduler with no UI indication).

The problem has two audiences:

  1. Users need to know when their action failed — with enough context to retry or give up. A contact form that swallows a rate-limit error and shows nothing is worse than one that says “You’ve hit the daily limit.”

  2. The operator needs to know when something broke under the hood — with enough context to diagnose without reading Cloud Run logs. A function that catches an error, logs it to stderr, and returns 200 leaves the operator blind unless they’re actively watching logs.

These are different channels with different tools, but the same policy: no silent failures. Every catch block must answer both questions — “does the user find out?” and “does the operator find out?” — and silence on either channel must be a deliberate, documented choice, not an oversight.

The project already has partial enforcement: the beacon agent audits frontend error surfacing, and Cloud Monitoring (see ADR 0012, infra/monitoring.tf) alerts on backend severity>=ERROR logs. But neither tool enforces the full policy, and the policy itself has never been written down. New code inherits whatever pattern the author happened to copy from.

Decision

Every failure path has exactly two responsibilities: tell the user and tell the operator. Neither is optional unless explicitly exempted.

Error severity tiers

Tier User channel Operator channel Example
User-recoverable UI feedback (toast, inline error, form field) None required Bad input, network retry
User-blocking UI feedback Sentry captureException Post won’t load, form won’t submit
System failure UI feedback if user-facing; none if background Cloud Monitoring alert (structured log at severity>=ERROR) Function crash, API quota exhausted
Silent degradation None Structured log at severity>=WARNING Config fallback, cache miss, optional feature disabled

The default is the strictest tier that applies. Downgrading (e.g., treating a user-blocking error as user-recoverable) requires a code comment explaining why.

Frontend rules

  1. User-facing error strings live in copy.json. No inline strings. Import via COPY from @app/content.
  2. Distinct HttpsError codes produce distinct user messages. Collapsing resource-exhausted, invalid-argument, and failed-precondition into one generic string is a defect.
  3. Caught errors that block the user must call captureException. The global SentryErrorHandler only catches uncaught errors. Anything in a catch block is invisible to Sentry unless explicitly captured.
  4. Error state must clear on retry or navigation. A spinner that never stops or an error banner that outlives its cause is a stale-state bug.
  5. No catch(() => {}). Every catch must do at least one of: update UI state, call captureException, or carry a comment explaining why silence is intentional.

Backend rules

  1. Structured JSON logging for alertable failures. console.error(JSON.stringify({ event: '...', ... })) with a stable event field that Cloud Monitoring can filter on.
  2. Every structured error event must have a matching Cloud Monitoring alert policy in infra/monitoring.tf — or an explicit comment explaining why it’s log-only.
  3. Functions that return success to schedulers must actually have succeeded. Returning HTTP 200 after swallowing an error prevents Cloud Scheduler retry and suppresses alerting. If the operation was skipped or degraded, the function should either return non-200 or write a status record that surfaces in the admin UI.
  4. Secrets and config failures are not silent-degradation tier. If a function can’t read its configuration, that’s a system failure — log at ERROR, not a quiet fallback to defaults.

Enforcement

Two agents enforce the policy in their respective domains:

  • Beacon (.claude/agents/beacon.md) — frontend. Audits components for swallowed catches, opaque messages, unhandled async, missing COPY keys, and stale error state. Reads copy.json, sentry.ts, and app.config.ts before reporting.
  • Canary (.claude/agents/canary.md) — backend. Audits Cloud Functions for unstructured error logging, missing Cloud Monitoring alert coverage, silent scheduler returns, and catch blocks that log without emitting a structured event. Reads functions/src/, infra/monitoring.tf, and firestore.rules before reporting.

Both agents flag and recommend — neither implements fixes. The developer builds the fix; the agent verifies it.

Relationship to issue #33 (notification/toast convention)

This ADR establishes the policy: “user-blocking errors must reach the user.” Issue #33 establishes the mechanism: a NotificationService or toast convention that provides the consistent UI surface. Until #33 lands, error surfacing remains per-component. That’s acceptable but fragile — the ADR applies regardless of whether a shared toast exists.

Consequences

Positive: - New code has a clear contract: every catch block answers two questions. - Beacon and canary provide automated enforcement during code review. - Cloud Monitoring coverage is explicitly tied to structured log events, not ad-hoc. - The severity tier table gives a shared vocabulary for triage.

Negative: - Existing code has gaps. This ADR does not retroactively fix them — it provides the standard for new code and a backlog for cleanup. - The “every structured event needs a monitoring alert” rule adds a Terraform step to backend error work. This is intentional friction. - Two agents means two audit passes. This is the cost of domain separation — frontend and backend error surfaces are fundamentally different frameworks with different tools.

Neutral: - The no-toast-service gap (#33) is acknowledged but not blocked on. The ADR is valid with or without a shared notification component.