Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough (venturebeat.com)

by mguiraud 5 days ago 0 comments

“When you get a demo and something works 90% of the time, that’s just the first nine.” — <a href="https://www.dwarkesh.com/p/andrej-karpathy">Andrej Karpathy</a>The “<a href="https://www.superagent.sh/blog/the-march-of-nines">March of Nines</a>” frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often requires comparable engineering effort. For enterprise teams, the distance between “usually works” and “operates like dependable software” determines adoption.<h2>The compounding math behind the March of Nines</h2>“Every single nine is the same amount of work.” — Andrej KarpathyAgentic workflows compound failure. A typical enterprise flow might include: intent parsing, context retrieval, planning, one or more tool calls, <a href="https://venturebeat.com/security/when-ai-lies-the-rise-of-alignment-faking-in-autonomous-systems">validation</a>, formatting, and <a href="https://venturebeat.com/orchestration/shadow-mode-drift-alerts-and-audit-logs-inside-the-modern-audit-loop">audit logging</a>. If a workflow has n steps and each step succeeds with probability p, end-to-end success is approximately p^n.In a 10-step workflow, the end-to-end success compounds due to the failures of each step. Correlated outages (auth, rate limits, connectors) will dominate unless you harden shared dependencies.<table><tbody><tr><td>Per-step success (p)</td><td>10-step success (p^10)</td><td>Workflow failure rate</td><td>At 10 workflows/day</td><td>What does this mean in practice</td></tr><tr><td>90.00%</td><td>34.87%</td><td>65.13%</td><td>~6.5 interruptions/day</td><td>Prototype territory. Most workflows get interrupted</td></tr><tr><td>99.00%</td><td>90.44%</td><td>9.56%</td><td>~1 every 1.0 days</td><td>Fine for a demo, but interruptions are still frequent in real use.</td></tr><tr><td>99.90%</td><td>99.00%</td><td>1.00%</td><td>~1 every 10.0 days</td><td>Still feels unreliable because misses remain common.</td></tr><tr><td>99.99%</td><td>99.90%</td><td>0.10%</td><td>~1 every 3.3 months</td><td>This is where it starts to feel like dependable enterprise-grade software.</td></tr></tbody></table><h2>Define reliability as measurable SLOs</h2>“It makes a lot more sense to spend a bit more time to be more concrete in your prompts.” — <a href="https://singjupost.com/andrej-karpathy-software-is-changing-again/">Andrej Karpathy</a>Teams achieve higher nines by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of SLIs that describe both model behavior and the surrounding system:<ul><li>Workflow completion rate (success or explicit escalation).</li><li>Tool-call success rate within timeouts, with strict schema validation on inputs and outputs.</li><li>Schema-valid output rate for every structured response (JSON/arguments).</li><li>Policy compliance rate (PII, secrets, and security constraints).</li><li>p95 end-to-end latency and cost per workflow.</li><li>Fallback rate (safer model, cached data, or human review).</li></ul>Set SLO targets per workflow tier (low/medium/high impact) and manage an error budget so experiments stay controlled.<h2>Nine levers that reliably add nines</h2><h4>1) Constrain autonomy with an explicit workflow graph</h4>Reliability rises when the system has bounded states and deterministic handling for retries, timeouts, and terminal outcomes.<ul><li>Model calls sit inside a state machine or a DAG, where each node defines allowed tools, max attempts, and a success predicate.</li><li>Persist state with idempotent keys so retries are safe and debuggable.</li></ul><h4>2) Enforce contracts at every boundary</h4>Most production failures start as interface drift: malformed JSON, missing fields, wrong units, or invented identifiers.<ul><li>Use JSON Schema/protobuf for every structured output and validate server-side before any tool executes.</li><li>Use enums, canonical IDs, and normalize time (ISO-8601 + timezone) and units (SI).</li></ul><h4>3) Layer validators: syntax, semantics, business rules</h4>Schema validation catches formatting. Semantic and business-rule checks prevent plausible answers that break systems.<ul><li>Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins by ID when available.</li><li>Business rules: approvals for write actions, data residency constraints, and customer-tier constraints.</li></ul><h4>4) Route by risk using uncertainty signals</h4>High-impact actions deserve higher assurance. Risk-based routing turns uncertainty into a product feature.<ul><li>Use confidence signals (classifiers, consistency checks, or a second-model verifier) to decide routing.</li><li>Gate risky steps behind stronger models, additional verification, or human approval.</li></ul><h4>5) Engineer tool calls like distributed systems</h4>Connectors and dependencies often dominate failure rates in agentic systems.<ul><li>Apply per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits.</li><li>Version tool schemas and validate tool responses to prevent silent breakage when APIs change.</li></ul><h4>6) Make retrieval predictable and observable</h4>Retrieval quality determines how grounded your application will be. Treat it like a versioned data product with coverage metrics.<ul><li>Track empty-retrieval rate, document freshness, and hit rate on labeled queries.</li><li>Ship index changes with canaries, so you know if something will fail before it fails.</li><li>Apply least-privilege access and redaction at the retrieval layer to reduce leakage risk.</li></ul><h4>7) Build a production evaluation pipeline</h4>The later nines depend on finding rare failures quickly and preventing regressions.<ul><li>Maintain an incident-driven golden set from production traffic and run it on every change.</li><li>Run shadow mode and A/B canaries with automatic rollback on SLI regressions.</li></ul><h4>8) Invest in observability and operational response</h4>Once failures become rare, the speed of diagnosis and remediation becomes the limiting factor.<ul><li>Emit traces/spans per step, store redacted prompts and tool I/O with strong access controls, and classify every failure into a taxonomy.</li><li>Use runbooks and “safe mode” toggles (disable risky tools, switch models, require human approval) for fast mitigation.</li></ul><h4>9) Ship an autonomy slider with deterministic fallbacks</h4>Fallible systems need supervision, and production software needs a safe way to dial autonomy up over time. Treat <a href="https://venturebeat.com/orchestration/vibe-coding-with-overeager-ai-lessons-learned-from-treating-google-ai-studio">autonomy</a> as a knob, not a switch, and make the safe path the default.<ul><li>Default to read-only or reversible actions, require explicit confirmation (or approval workflows) for writes and irreversible operations.</li><li>Build deterministic fallbacks: retrieval-only answers, cached responses, rules-based handlers, or escalation to human review when confidence is low.</li><li>Expose per-tenant safe modes: disable risky tools/connectors, force a stronger model, lower temperature, and tighten timeouts during incidents.</li><li>Design resumable handoffs: persist state, show the plan/diff, and let a reviewer approve and resume from the exact step with an idempotency key.</li></ul><h2>Implementation sketch: a bounded step wrapper</h2>A small wrapper around each model/tool step converts unpredictability into policy-driven control: strict validation, bounded retries, timeouts, telemetry, and explicit fallbacks.def run_step(name, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15): # trace all retries under one span span = start_span(name) for attempt in range(1, max_attempts + 1): try: # bound latency so one step can’t stall the workflow with deadline(timeout_s): out = attempt_fn()
# gate: schema + semantic + business invariants validate_fn(out) # success path metric("step_success", name, attempt=attempt) return out except (TimeoutError, UpstreamError) as e: # transient: retry with jitter to avoid retry storms span.log({"attempt": attempt, "err": str(e)}) sleep(jittered_backoff(attempt)) except ValidationError as e: # bad output: retry once in “safer” mode (lower temp / stricter prompt) span.log({"attempt": attempt, "err": str(e)}) out = attempt_fn(mode="safer") # fallback: keep system safe when retries are exhausted metric("step_fallback", name) return EscalateToHuman(reason=f"{name} failed")<h2>Why enterprises insist on the later nines</h2>Reliability gaps translate into business risk. <a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai">McKinsey’s 2025 global survey</a> reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.<h2>Closing checklist</h2><ul><li>Pick a top workflow, define its completion SLO, and instrument terminal status codes.</li><li>Add contracts + validators around every model output and tool input/output.</li><li>Treat connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries).</li><li>Route high-impact actions through higher assurance paths (verification or approval).</li><li>Turn every incident into a regression test in your golden set.</li></ul>The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.<a href="https://mungel.com">Nikhil Mungel</a> has been building distributed systems and AI teams at SaaS companies for more than 15 years.

AI News

Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough (venturebeat.com)

Comments