Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough (venturebeat.com)
<p><i>“When you get a demo and something works 90% of the time, that’s just the first nine.” — </i><a href="https://www.dwarkesh.com/p/andrej-karpathy"><i><u>Andrej Karpathy</u></i></a></p><p>The “<a href="https://www.superagent.sh/blog/the-march-of-nines">March of Nines</a>” frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often requires comparable engineering effort. For enterprise teams, the distance between “usually works” and “operates like dependable software” determines adoption.</p><h2>The compounding math behind the March of Nines</h2><p><i>“Every single nine is the same amount of work.” — Andrej Karpathy</i></p><p>Agentic workflows compound failure. A typical enterprise flow might include: intent parsing, context retrieval, planning, one or more tool calls, <a href="https://venturebeat.com/security/when-ai-lies-the-rise-of-alignment-faking-in-autonomous-systems">validation</a>, formatting, and <a href="https://venturebeat.com/orchestration/shadow-mode-drift-alerts-and-audit-logs-inside-the-modern-audit-loop">audit logging</a>. If a workflow has n steps and each step succeeds with probability <i>p</i>, end-to-end success is approximately <i>p^n</i>.</p><p>In a 10-step workflow, the end-to-end success compounds due to the failures of each step. Correlated outages (auth, rate limits, connectors) will dominate unless you harden shared dependencies.</p><table><tbody><tr><td><p><b>Per-step success (p)</b></p></td><td><p><b>10-step success (p^10)</b></p></td><td><p><b>Workflow failure rate</b></p></td><td><p><b>At 10 workflows/day</b></p></td><td><p><b>What does this mean in practice</b></p></td></tr><tr><td><p>90.00%</p></td><td><p>34.87%</p></td><td><p>65.13%</p></td><td><p>~6.5 interruptions/day</p></td><td><p><b>Prototype territory. Most workflows get interrupted</b></p></td></tr><tr><td><p>99.00%</p></td><td><p>90.44%</p></td><td><p>9.56%</p></td><td><p>~1 every 1.0 days</p></td><td><p><b>Fine for a demo, but interruptions are still frequent in real use.</b></p></td></tr><tr><td><p>99.90%</p></td><td><p>99.00%</p></td><td><p>1.00%</p></td><td><p>~1 every 10.0 days</p></td><td><p><b>Still feels unreliable because misses remain common.</b></p></td></tr><tr><td><p>99.99%</p></td><td><p>99.90%</p></td><td><p>0.10%</p></td><td><p>~1 every 3.3 months</p></td><td><p><b>This is where it starts to feel like dependable enterprise-grade software.</b></p></td></tr></tbody></table><h2>Define reliability as measurable SLOs</h2><p><i>“It makes a lot more sense to spend a bit more time to be more concrete in your prompts.” — </i><a href="https://singjupost.com/andrej-karpathy-software-is-changing-again/"><i><u>Andrej Karpathy</u></i></a></p><p>Teams achieve higher nines by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of SLIs that describe both model behavior and the surrounding system:</p><ul><li><p>Workflow completion rate (success or explicit escalation).</p></li><li><p>Tool-call success rate within timeouts, with strict schema validation on inputs and outputs.</p></li><li><p>Schema-valid output rate for every structured response (JSON/arguments).</p></li><li><p>Policy compliance rate (PII, secrets, and security constraints).</p></li><li><p>p95 end-to-end latency and cost per workflow.</p></li><li><p>Fallback rate (safer model, cached data, or human review).</p></li></ul><p>Set SLO targets per workflow tier (low/medium/high impact) and manage an error budget so experiments stay controlled.</p><h2>Nine levers that reliably add nines</h2><h4>1) Constrain autonomy with an explicit workflow graph</h4><p>Reliability rises when the system has bounded states and deterministic handling for retries, timeouts, and terminal outcomes.</p><ul><li><p>Model calls sit inside a state machine or a DAG, where each node defines allowed tools, max attempts, and a success predicate.</p></li><li><p>Persist state with idempotent keys so retries are safe and debuggable.</p></li></ul><h4>2) Enforce contracts at every boundary</h4><p>Most production failures start as interface drift: malformed JSON, missing fields, wrong units, or invented identifiers.</p><ul><li><p>Use JSON Schema/protobuf for every structured output and validate server-side before any tool executes.</p></li><li><p>Use enums, canonical IDs, and normalize time (ISO-8601 + timezone) and units (SI).</p></li></ul><h4>3) Layer validators: syntax, semantics, business rules</h4><p>Schema validation catches formatting. Semantic and business-rule checks prevent plausible answers that break systems.</p><ul><li><p>Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins by ID when available.</p></li><li><p>Business rules: approvals for write actions, data residency constraints, and customer-tier constraints.</p></li></ul><h4>4) Route by risk using uncertainty signals</h4><p>High-impact actions deserve higher assurance. Risk-based routing turns uncertainty into a product feature.</p><ul><li><p>Use confidence signals (classifiers, consistency checks, or a second-model verifier) to decide routing.</p></li><li><p>Gate risky steps behind stronger models, additional verification, or human approval.</p></li></ul><h4>5) Engineer tool calls like distributed systems</h4><p>Connectors and dependencies often dominate failure rates in agentic systems.</p><ul><li><p>Apply per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits.</p></li><li><p>Version tool schemas and validate tool responses to prevent silent breakage when APIs change.</p></li></ul><h4>6) Make retrieval predictable and observable</h4><p>Retrieval quality determines how grounded your application will be. Treat it like a versioned data product with coverage metrics.</p><ul><li><p>Track empty-retrieval rate, document freshness, and hit rate on labeled queries.</p></li><li><p>Ship index changes with canaries, so you know if something will fail before it fails.</p></li><li><p>Apply least-privilege access and redaction at the retrieval layer to reduce leakage risk.</p></li></ul><h4>7) Build a production evaluation pipeline</h4><p>The later nines depend on finding rare failures quickly and preventing regressions.</p><ul><li><p>Maintain an incident-driven golden set from production traffic and run it on every change.</p></li><li><p>Run shadow mode and A/B canaries with automatic rollback on SLI regressions.</p></li></ul><h4>8) Invest in observability and operational response</h4><p>Once failures become rare, the speed of diagnosis and remediation becomes the limiting factor.</p><ul><li><p>Emit traces/spans per step, store redacted prompts and tool I/O with strong access controls, and classify every failure into a taxonomy.</p></li><li><p>Use runbooks and “safe mode” toggles (disable risky tools, switch models, require human approval) for fast mitigation.</p></li></ul><h4>9) Ship an autonomy slider with deterministic fallbacks</h4><p>Fallible systems need supervision, and production software needs a safe way to dial autonomy up over time. Treat <a href="https://venturebeat.com/orchestration/vibe-coding-with-overeager-ai-lessons-learned-from-treating-google-ai-studio">autonomy</a> as a knob, not a switch, and make the safe path the default.</p><ul><li><p>Default to read-only or reversible actions, require explicit confirmation (or approval workflows) for writes and irreversible operations.</p></li><li><p>Build deterministic fallbacks: retrieval-only answers, cached responses, rules-based handlers, or escalation to human review when confidence is low.</p></li><li><p>Expose per-tenant safe modes: disable risky tools/connectors, force a stronger model, lower temperature, and tighten timeouts during incidents.</p></li><li><p>Design resumable handoffs: persist state, show the plan/diff, and let a reviewer approve and resume from the exact step with an idempotency key.</p></li></ul><h2>Implementation sketch: a bounded step wrapper</h2><p>A small wrapper around each model/tool step converts unpredictability into policy-driven control: strict validation, bounded retries, timeouts, telemetry, and explicit fallbacks.</p><p><i>def run_step(name, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15):</i></p><p><i> # trace all retries under one span</i></p><p><i> span = start_span(name)</i></p><p> <i> for attempt in range(1, max_attempts + 1):</i></p><p><i> try:</i></p><p><i> # bound latency so one step can’t stall the workflow</i></p><p><i> with deadline(timeout_s):</i></p><p><i> out = attempt_fn()</i></p><p>
<i># gate: schema + semantic + business invariants</i></p><p><i> validate_fn(out)</i></p><p><i> # success path</i></p><p><i> metric("step_success", name, attempt=attempt)</i></p><p><i> return out</i></p><p><i> except (TimeoutError, UpstreamError) as e:</i></p><p><i> # transient: retry with jitter to avoid retry storms</i></p><p><i> span.log({"attempt": attempt, "err": str(e)})</i></p><p><i> sleep(jittered_backoff(attempt))</i></p><p><i> except ValidationError as e:</i></p><p><i> # bad output: retry once in “safer” mode (lower temp / stricter prompt)</i></p><p><i> span.log({"attempt": attempt, "err": str(e)})</i></p><p><i> out = attempt_fn(mode="safer")</i></p><p><i> # fallback: keep system safe when retries are exhausted</i></p><p><i> metric("step_fallback", name)</i></p><p><i> return EscalateToHuman(reason=f"{name} failed")</i></p><h2>Why enterprises insist on the later nines</h2><p>Reliability gaps translate into business risk. <a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai"><u>McKinsey’s 2025 global survey</u></a> reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.</p><h2>Closing checklist</h2><ul><li><p>Pick a top workflow, define its completion SLO, and instrument terminal status codes.</p></li><li><p>Add contracts + validators around every model output and tool input/output.</p></li><li><p>Treat connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries).</p></li><li><p>Route high-impact actions through higher assurance paths (verification or approval).</p></li><li><p>Turn every incident into a regression test in your golden set.</p></li></ul><p>The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.</p><p><i></i><a href="https://mungel.com"><i><u>Nikhil Mungel</u></i></a><i> has been building distributed systems and AI teams at SaaS companies for more than 15 years. </i></p>
<i># gate: schema + semantic + business invariants</i></p><p><i> validate_fn(out)</i></p><p><i> # success path</i></p><p><i> metric("step_success", name, attempt=attempt)</i></p><p><i> return out</i></p><p><i> except (TimeoutError, UpstreamError) as e:</i></p><p><i> # transient: retry with jitter to avoid retry storms</i></p><p><i> span.log({"attempt": attempt, "err": str(e)})</i></p><p><i> sleep(jittered_backoff(attempt))</i></p><p><i> except ValidationError as e:</i></p><p><i> # bad output: retry once in “safer” mode (lower temp / stricter prompt)</i></p><p><i> span.log({"attempt": attempt, "err": str(e)})</i></p><p><i> out = attempt_fn(mode="safer")</i></p><p><i> # fallback: keep system safe when retries are exhausted</i></p><p><i> metric("step_fallback", name)</i></p><p><i> return EscalateToHuman(reason=f"{name} failed")</i></p><h2>Why enterprises insist on the later nines</h2><p>Reliability gaps translate into business risk. <a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai"><u>McKinsey’s 2025 global survey</u></a> reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.</p><h2>Closing checklist</h2><ul><li><p>Pick a top workflow, define its completion SLO, and instrument terminal status codes.</p></li><li><p>Add contracts + validators around every model output and tool input/output.</p></li><li><p>Treat connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries).</p></li><li><p>Route high-impact actions through higher assurance paths (verification or approval).</p></li><li><p>Turn every incident into a regression test in your golden set.</p></li></ul><p>The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.</p><p><i></i><a href="https://mungel.com"><i><u>Nikhil Mungel</u></i></a><i> has been building distributed systems and AI teams at SaaS companies for more than 15 years. </i></p>
Comments