Running sprints with AI teammates — what actually

Most posts about “AI in agile” are still about workflows — letting Cursor finish a function, having Claude rewrite a docstring. That story is yesterday. The story this year is different: a chunk of your team is now AI agents that pick up tickets, ship code, and close PRs. Not assistants. Teammates, with throughput you can measure.

If you run sprints, this changes the meeting. It changes velocity. It changes what a retro is even for. Most teams are still running their sprint cycle as if every contributor were human, and the seams are starting to show.

This is the practical version: what actually changes about sprint planning, velocity, code review, and retros when half your throughput is non-human, and the four mistakes teams make in the first three sprints with AI teammates.

What an AI teammate actually is

For this post, an AI teammate is a Claude Code / Cursor agent / Devin / Aider session that:

Picks tickets from a backlog (manually triggered or scheduled)
Runs tests, opens PRs, addresses review comments
Closes its own loop on the small/medium tickets without a human in the inner loop

It is not the autocomplete assistant in your editor — that’s still useful, but it doesn’t change planning. The teammate is the one that completes a ticket end-to-end while you sleep.

If your team has zero of these yet, the rest of this post is preview. If you have one, you’re already feeling the friction. If you have three or more, you’ve probably realized your sprint board doesn’t model the world anymore.

What changes in sprint planning

Three concrete shifts.

1. You plan two queues, not one.

Human work and AI-teammate work do not interleave well. Humans pick deep, complex, ambiguous work; agents pick small, well-specified, mechanically defined work. If you put both in one sprint backlog, two things happen:

Humans pick up the small tickets first (“quick wins”) and stall on the deep work.
Agents try to pick up the deep work and produce nine PRs that are 60% right and need re-doing.

The fix: tag work human or agent. They go in the same sprint, but they’re separate queues. Plan capacity for each. (See: story splitting, which becomes load-bearing — the splitter decides whether a ticket goes to humans or agents.)

2. The estimation conversation is different.

For human tickets: the usual relative sizing.

For agent tickets: the question isn’t “how big” — it’s “is the spec good enough that an agent can finish it without a human round-trip.” That’s a yes/no, not a points number. Tickets that fail this check go back to refinement, they don’t go to agents at half-spec.

Most teams underestimate the specification cost of agent work. A ticket that takes a senior engineer 2 hours might take 30 minutes of agent runtime — but it requires 45 minutes of writing the spec sharply enough that the agent doesn’t go off the rails. Net time saved: 45 minutes, not 2 hours. Real, but not the 10× win some claim.

3. Capacity is fuzzy on the agent side.

Human capacity is well-bounded — focus factor × person-hours. Agent capacity is bounded by:

Spec-writing capacity (how fast the team can produce agent-runnable tickets)
Review capacity (how fast humans can review agent PRs)
Compute budget (real money, sometimes the binding constraint)

In practice, review capacity is usually the bottleneck. A team with 4 humans and 3 active agents will typically max out at “the reviews humans can do” — which is well below what the agents could ship. Plan around the bottleneck. (See: sprint capacity calculator.)

What changes in velocity

Velocity gets weird, in three ways.

Total velocity goes up — sometimes a lot. A team that was shipping 32 points/sprint can hit 50+ once they’ve figured out which tickets agents can take. That’s real, not a chart artifact.

Velocity gets noisier. Agent ticket throughput depends heavily on how well the spec was written, which varies week to week. Expect a bigger sprint-to-sprint swing on agent-side velocity than humans ever produced. Don’t react to a single noisy sprint.

Comparing this team to its old self stops working. Your historical velocity from before agent teammates is a different number than your current velocity. Reset the baseline explicitly when you cross over. Don’t tell stakeholders “we’re 60% faster now” without naming the substrate change — they’ll plan around the new number, but they should know it includes a non-human team that costs money to run.

If your velocity drops after agent teammates joined, the diagnosis playbook is mostly the same as velocity dropped — here’s the actual playbook, with two extra causes to check:

Review backlog — agents shipping faster than humans can review.
Spec drift — early agent tickets had clear specs because the team wrote them carefully; recent ones got sloppy because “agents can figure it out.” They can’t.

What changes in code review

Three new failure modes.

1. Reviewer fatigue from too many similar PRs.

Five agent PRs with similar shape (“rename foo to bar across the codebase,” each in a different module) are exhausting to review carefully. Reviewers either skim and approve, or they batch-reject without specific feedback. Both are bad.

The fix: collapse mechanically-similar PRs into one before review. If your agent shipped five rename PRs, ask it to consolidate them. Review one PR carefully, not five lazily.

2. Subtle correctness bugs in confidently-written code.

Agent-written code looks polished — clean variable names, full type hints, consistent style. The bugs are usually in places humans don’t look: the off-by-one at a boundary, the silent fallback when an API returns null, the test that mocks the thing it should be testing.

The fix: review agent PRs with higher scrutiny than human PRs in the early sprints. Trust calibrates over time as you see what the agent gets wrong. Don’t start at high trust.

3. Review comments that don’t get learned.

A human teammate who gets the same review comment three times changes their behavior. An agent doesn’t, unless you put the lesson into a system prompt or skill file. Without that, you’re paying the same review cost every sprint.

The fix: every recurring review comment becomes a CLAUDE.md / skill / rules file entry. The team’s review knowledge gets compiled into the agent’s context. After 4-5 sprints of this, agent PR quality moves up noticeably.

What changes in retrospectives

This is the surprising one. The retro is more important with agent teammates, not less. Two reasons:

The system has more moving parts. With humans-only, a retro mostly looks at people, process, and tools. With agents, you also retro on: spec quality, review throughput, agent failure modes, compute cost, the rules files. The honest action items are usually about the agent setup, not about people.

Agent failures are silent. A human who gets stuck pings the team. An agent that’s stuck either runs in a loop, opens a wrong PR, or quietly stops. None of these show up in standup. They show up in a retro question — “what tickets did we not ship this sprint that we expected to?” — and then you find them.

The retro format that works best for hybrid teams isn’t the standard Start/Stop/Continue. It’s the 4Ls (Liked/Learned/Lacked/Longed-for) — because “Learned” is where the team captures what went wrong with agent runs without making it a person problem, and “Lacked” surfaces the spec/rules gaps that need closing.

The four mistakes teams make in the first three sprints

In rough order of how often we see them:

1. Treating agent capacity as free. It isn’t. Compute costs money, review costs human time, spec-writing costs senior-engineer time. When teams fail to budget for these, agent throughput plateaus and the team feels frustrated with no clear cause.

2. Mixing agent and human tickets in one sprint queue. Causes the picking-stall described above. Tag and separate. Sub-sprints if you have to.

3. Skipping the spec-quality gate. “We’ll just throw this at the agent” produces a PR you spend 90 minutes fixing. A 30-minute spec-quality conversation upstream saves hours downstream. This is the single highest-ROI discipline change.

4. Not retro-ing the agent setup. Teams retro on people-process-tools and forget the agent-process-rules. The agent setup is half your throughput now. Retro it.

What this means for your sprint board

Most modern sprint tools — including SprintFlint — let you tag tickets, assign them to non-human “agents,” and view burndown filtered by who shipped what. If your tool can’t do that, the board is going to lie to you within a couple of sprints.

What you actually need on the board:

An “owner” type that includes humans and agents (different colour, same status flow)
Filters on agent vs human velocity
Review-queue visibility (because review is now the bottleneck, not write)
A spec-quality gate before the agent-pickable column

If your current tracker pretends every contributor is human, you’ll feel it most at planning, when the team tries to forecast and the chart is wrong on the substrate level.

(SprintFlint’s agent integrations — Cursor, Claude Code, Aider, Windsurf, Zed, Continue, Cline, Codex CLI — read tickets directly from the editor and update them as the agent ships. Your sprint board reflects what shipped, regardless of who shipped it.)

The honest summary

AI teammates aren’t the same as faster humans. They’re a different unit of throughput with different capacity constraints, different failure modes, and different review costs. The teams who get the most out of them treat them as teammates with structural needs, not as productivity boosters.

Three changes that usually unlock the next 50% of throughput:

Tag and separate human and agent tickets at planning.
Spec-quality gate before agents can pick a ticket.
Retro the agent setup every sprint, with rules-file updates as action items.

Run those for four sprints. The numbers move.

Related reading:

AI in agile workflows: 5 ways teams use AI to improve sprints — the workflow-level view.
Velocity dropped — here’s the actual playbook — covers the agent-specific velocity-drop causes.
Agile estimation techniques compared — how estimation changes when half the work is agent-driven.

Tools:

Retro Format Picker — recommends 4Ls for hybrid human/agent teams.
Sprint Capacity Calculator — bound capacity by review throughput, not just headcount.
Story Splitter — splits drive whether a ticket goes to humans or agents.

Running sprints with AI teammates — what actually changes (planning, velocity, code review, retros)