· craft · ai-agents · specifications · gartner

The next spec I write has to allow the agent to be wrong

When the system you are specifying can be wrong on purpose, the old acceptance-criteria discipline breaks. Six rough spec classes I now use.

Last Monday I sat for a long time staring at the cell on a spec where the normal “expected output” line should go. The feature called an LLM somewhere in the middle of the request path. The team needed something in that cell. I did not have it.

This is becoming a regular Monday.

The people I work with all week are business analysts and systems analysts, and almost none of us have a clean answer for what an acceptance criterion looks like when the system on the other end of the spec can give a slightly wrong answer on purpose. My job is to specify that percentage and what happens when it gets hit. Pretending the percentage is zero stopped working a while ago.

When I learned to program I had an МК-61. 105 bytes of memory, and you typed the program one keystroke at a time, and the keystrokes themselves cost memory - so optimization started before the algorithm did. The thing was deterministic in the boring, total sense. Same inputs, same outputs, forever. A defect was binary - either the trace reproduced or the bug was in your own head.

Twenty-something years later I was at Alfa-Bank writing API contracts, and Roman Oshurkevich taught me how to write a spec a developer could implement without asking a single question - server path, full URL, request example with the actual payload, response example, the empty-string corner case with the exact standard validation error code. I am still genuinely grateful to him for that. It is a real skill and rarer than people think. The recipient had changed - a front-end developer instead of a calculator - but the contract underneath was the same. If I wrote that GET /v1/accounts/{id} returns 404 when the id is unknown, the developer wrote code that returned 404. We argued about edge cases, never about what 404 meant.

Then Axioma, the flower exchange. The first spec I wrote there was twenty pages for a feature smaller than what I had been shipping at the bank in three. Ablyakim, the team lead - solid engineer, I learned a lot from him - walked over to my desk and said “это слишком подробно, не надо так”. I took it personally for about five minutes. He was right. The senior team there had seen those empty-string cases a hundred times in production and would handle them sensibly without me holding their hand. The spec stayed deterministic. I just stopped over-specifying, and the time to draft one dropped by about 30%.

Three different recipients across three decades and the shape of the work underneath was the same. I described a system that, once built, would respond to a given input with one correct answer.

I want to be careful with the next sentence, because I keep writing a lazy version of it.

The lazy version says the new systems are probabilistic and the old ones were deterministic. That is not actually true, and any senior ML engineer would call it out. Recommendation engines, fraud scoring, OCR confidence, ranker models, the medical-imaging classifiers I was reading papers about back around 2014 - production has been shipping probabilistic components for at least a decade. Acceptance criteria for those existed. They were uncomfortable, threshold-based, sometimes statistical, but they existed and I have written them.

What is actually new with agents is something I keep failing to name properly in front of stakeholders. The loss of the explicit execution graph.

With a fraud classifier I still knew which function called which, what features went in, what the threshold was, what the downstream branch did with a “fraud” verdict. With an agentic system the execution path is non-local. The model decides which tool to call, in what order, based on a reasoning trace I cannot fully inspect. The objective itself has semantic ambiguity - “schedule the meeting” can mean five different actions depending on context the model invents in flight. The system has bounded autonomy under incomplete constraints, and the constraints I write are one of several inputs the model weighs against its training, the conversation so far, and whatever the most recent tool call returned. The flowchart is not mine to draw anymore.

That is the new thing. Not probability - I have been living with probability since the M.Video era of pre-agent BA work. The new thing is that I no longer get to draw the execution graph and hand it over.

I have been forced to build a working taxonomy for these specs. Six rough classes, none of them clean yet:

  • Deterministic assertions. Exact output for cases that have an exact answer. These survive. Most agent systems still have a non-agent core that takes structured input and must return structured output, and the old Alfa-Bank discipline applies untouched there.
  • Behavioral constraints. Tool X must be called before tool Y in this situation. The test is a trace assertion against the agent’s tool-call log, not an output match. Order of operations as a first-class spec item.
  • Bounded autonomy. When the agent’s confidence on a routing decision drops below a threshold, it hands off to a human queue rather than guess. The threshold is a number in the spec. The queue size is a release-gate.
  • Governance assertions. Any tool call that mutates state without a logged justification gets rejected by a governance layer before it reaches the downstream system. The agent is allowed to attempt. The system is not allowed to comply blindly.
  • Observability assertions. Every decision the agent makes is traceable end-to-end - which tool, which inputs, which model version, which reasoning trace. Without this the rest of the criteria are unfalsifiable.
  • Degradation contracts. When the model is uncertain, what does the system do. Fall back to a deterministic path. Ask the user. Refuse. The fallback is part of the spec, not an afterthought added after the first production incident.

Most of my actual recent specs hit two or three of these and quietly pretend the rest do not apply. Honestly, I would not put those specs in front of Roman without flinching.

The thing I keep underestimating is how much of this is political rather than technical. Legal wants deterministic accountability and reads “bounded autonomy” as “we accept some rate of being wrong on purpose”, which they do not love. Product wants flexibility, because flexibility is what the demo sells. Engineering wants autonomy, because constrained agents underperform on the benchmarks the engineers get measured against. The AI vendor sells a capability distribution and is not the one signing the SLA. The steering-committee meeting where those four views collide is where my next decade of spec-writing actually happens, and my Alfa-era template does not help me run that meeting.

Now to the Gartner number, because I probably would not be writing this on a Wednesday afternoon if it had not landed in my feed on Monday. Gartner says global AI spend hits $2.59 trillion in 2026, up 47% year-over-year, and John-David Lovelock, the Distinguished VP Analyst on the release, calls this the inflection year (source). Over 45% of that spend goes into AI infrastructure. AI-optimized servers triple over the next five years. The models layer alone grows 110% and adds $6 billion in spending this year (source).

The piece of that I cannot stop thinking about is not “AI is big now”. It is that the budget for systems whose acceptance criteria most of us do not yet know how to write is forty-seven percent bigger this year than last year. The chips and racks will get bought either way. Whether the BRDs and the SADs catch up before the systems hit production is the open question, and the answer is not going to come from the vendor or the model. The steering committee will assume I had the answer six months ago.

I spent two decades specifying systems that were supposed to be correct. What I am specifying now is a system that is useful in aggregate, wrong a knowable fraction of the time, and accountable for what it does when it is wrong. I kept wanting to call that the old discipline with new templates bolted on. After this Gartner read I do not think that anymore. It is a different discipline, and most of us are still learning the first version of it in production.

How are you handling this in your specs? When there is an LLM somewhere in the request path, what does the acceptance criterion look like in your team?

← All articles