Why I am writing acceptance criteria four times for the same agent

Last week I sat down to write what “done” means for a tiny agent in my own side project. The agent reads inputs, suggests a next action, then waits for me to approve or reject. Standard Act-with-Approval shape - I see the suggestion, I press one button, the system either does it or learns it should not have suggested that. Nothing in production, no real users, just me arguing with my own code on a Saturday.

Worth being upfront before this goes any further. I do not have commercial AI-agent integration experience yet. The project I just described is a personal sandbox. So please read what follows as me thinking out loud about a craft problem I am genuinely working on, not as a battle-tested playbook.

What pushed me to write this down was a Gartner press release on 26 May 2026 (source). Shiva Varma, Senior Director Analyst at Gartner, framed the core problem in one sentence: “Enterprises are treating AI agent governance as binary, either locked down or fully trusted, and that is the root cause of failure.” (source) Gartner’s prediction is that by 2027, 40% of enterprises will demote or decommission autonomous AI agents because of governance gaps that only get noticed after a production incident. (source)

What Gartner proposes instead is a ladder of four autonomy levels - Observe, Advise, Act with Approval, Act Autonomously. (source) The thing that hooked me is not the ladder itself but the way the acceptance criteria I write as an analyst have to change shape at every step on it. One uniform governance section copy-pasted across all four levels is exactly the failure mode Gartner is describing, and I think that lands on the BA’s desk first, before anyone in security or compliance gets a vote.

Imagine a single agent that classifies incoming support tickets and proposes an action - route to team X, ask the customer Y, escalate. The same agent can sit at any of the four levels depending on how the business wants to use it.

At the Observe level the agent looks at tickets and writes its suggestion into a log. Nothing happens to the ticket itself. My acceptance criteria here are not about whether the answer is right - they are about visibility. Every suggestion has to be written together with the inputs the model saw, there has to be a stable identifier I can pull six weeks later to reconstruct what the agent thought, and the log writes must not block ticket handling when the model is slow or down. Most of what I write in this section ends up being about traceability and graceful degradation, almost nothing about model quality.

At the Advise level the agent’s suggestion appears next to the ticket in the agent console. A human still routes the ticket. The agent just makes the first guess louder. My acceptance criteria shift toward presentation - how the suggestion is shown, how easy it is to dismiss without clicks getting in the way, what happens when the model fails to return inside the time the agent has, and whether the human can always tell at a glance that this is a suggestion and not a decision. Model quality starts to matter here, but only because a bad suggestion costs a human a few seconds of attention. It still does nothing to the ticket on its own.

At the Act with Approval level - this is the one I am living in on my side project - the agent prepares the action and I approve. The criteria add a whole layer. The approval queue cannot grow unbounded. Approvals cannot expire silently in a way nobody notices. Both “approve” and “reject” produce an audit trail that says who, when, and ideally why. The agent must not start a second action while the first one is still pending. On my side project the corner case that ate most of my Saturday was simple to describe and not simple to spec: what does the system do when the agent suggests B before A has finished? I queued B behind A and re-ran the suggestion after A landed. In a real product that one paragraph is a separate spec section with its own owner.

At the Act Autonomously level the agent just does the thing. My acceptance criteria are mostly about boundaries and reversal now. What ranges can it act inside without escalating? What does “an incident” look like, not in the operational sense but in the business sense? How fast can a human stop it, and how cleanly does the system roll back the half-completed action? Shiva Varma’s other line in that release - that when agents act autonomously, “actions are executed at a scale and speed that can outpace human oversight” (source) - is the part I find genuinely scary, because if I have written the boundaries wrong, by the time I notice it is already a Monday morning incident review.

Four different stacks of criteria for the same business function, with the same prompt going to the same model and the same downstream system on the other end, and the four stacks looked nothing like each other. If I had written one uniform governance section and dropped it across all four, the Observe level would have been comically over-engineered (why do I need approval flows for a logger?) and the Autonomous level would have been silently under-engineered (where is the kill switch in this paragraph?). That mismatch is, near as I can tell, exactly the binary that Shiva Varma is pointing at.

What that walk-through told me is that the authority level is the line you write down first. The whole rest of the spec, model name, prompt, integration shape, lands downstream of that one decision.

A fair pushback I keep hearing in my own head when I write these out. A lot of what I am calling acceptance criteria is really closer to non-functional requirements, operational controls, governance asks. Audit trail, kill switch, approval-queue back-pressure - strictly speaking those live in different sections of a traditional spec, and a strict BA Lead would be right to call that out. I think the split still helps me classify but stops helping me decide once an agent is in the document. When the same page has to talk about how the agent reasons, what happens when it is wrong, who approves it, and how fast a human can stop it, the AC/NFR/governance carving tells me less than the autonomy level does. So the split I actually use now is the ladder. The traditional one is still in the section headers.

There is a second pushback I owe an honest answer to. Maybe four separate spec stacks is the wrong frame and the right one is a single governance matrix where the same controls are listed once and their severity, thresholds, and required gates flip per level. The matrix is more compact on paper, and it stops a company from quadrupling its review surface. The four-stack version is easier to hand to four different review committees, because each committee owns one column. I do not know yet which one wins in practice. On my side project I have neither, I have notes, so this question is still open for me.

Years ago at Alfa-Bank a front-end lead called Roman Oshurkevich taught me to write specs a developer could implement without asking a single question. Real skill, rarer than people think. The instinct he drilled in was: name the corner cases up front, name the expected behavior under each, never let the developer guess. That instinct still works clean for the lower tiers of this ladder. Observe and Advise are basically traditional integration specs with a model in the middle, same enumerative discipline, same “for every input X, the expected output is Y” shape. It starts to bend at Act with Approval and breaks down for me at Act Autonomously. What goes non-deterministic at this tier is not really the model’s output anymore but the timing around it, queueing, partial failure between “I approved” and “the action completed”, the fact that the action can keep happening while I am thinking about the next one. And the opposite trap is just as easy to fall into - my old Axioma team lead Ablyakim once picked up a 20-page spec of mine and said “это слишком подробно, не надо так”. He was right then. The risk now is writing those same 20 pages four times instead of once. The right move is the opposite of that, each level gets its own tight, level-specific set of criteria, and most of the words end up in Advise and Act with Approval, where humans and agents share the decision.

There is something else I keep circling. The Saga pattern works as a label - I can say “this is a Saga” and a developer pictures the compensating actions, the local transactions, the eventual-consistency story, the whole shape. We do not yet have a label like that for “this is an Act with Approval-class agent”. I tried a different cut of this problem ten days ago — a six-class taxonomy of agent specs sitting on a different axis from the Gartner ladder — and the exercise mostly confirmed that one BA writing notes on a Saturday is not how shared vocabulary gets built. I think we need a real label. Honestly the BA/SA community is unlikely to be the place it gets invented. Saga did not come from us. Neither did CAP or BPMN. The vocabulary will come from cloud vendors, platform vendors, standards bodies. What we can do, and what I think we should do, is help adapt the patterns as they land, write the first concrete spec templates per level, and refuse to keep shipping one-size-fits-all governance sections in the meantime. I do not have those templates yet. The article I really want to publish next is the one where I have at least the first one drafted.

The next time a product owner says “let us add an AI agent that does X”, the first conversation I am going to have is about which of the four levels we are starting at, and what the trigger would be for moving up the ladder. The model and the prompt come up much later. The autonomy level shapes everything I have to write down after that conversation. And if I am honest with myself, that is also the conversation I have been avoiding on my own side project, because I already know that the moment I push it from Act with Approval to Act Autonomously, I have a lot more to write than I currently have.

A question for the BAs and SAs reading this. When you write down criteria for a feature that touches an AI agent, do you implicitly write at one level - usually whichever level the team is comfortable with - or do you make the autonomy level a first-class decision in the spec itself? I am genuinely curious where people sit on this. Especially if you are already in production, because I am not, and I want to learn from you.