Most sprint forecasts miss because they’re built on the wrong number. The team computes an average cycle time, multiplies by the number of stories, calls it a date. Half the time the date is late. Sometimes very late.
The fix isn’t a fancier method. It’s a different statistic. Forecast at p85 — the 85th percentile of historical cycle time — and your dates start landing.
This is the practical version. What p85 actually means, why the average lies, the rough math, the message you give stakeholders, and the four anti-patterns teams hit when they try to switch.
Why the average lies
Cycle times almost always have a long tail. A few stories take 2x, 5x, 10x the median because of blocked dependencies, hidden complexity, sick days, scope ambiguity. Those stories don’t change the median much, but they wreck individual dates.
The mean tells you what an average ticket looks like in aggregate. It says nothing about what this ticket will do.
A team with median 3 days and a few 20-day outliers might have:
- Median (p50): 3 days
- Mean: 5 days
- p85: 8 days
- p95: 18 days
If you forecast at the mean and ship 12 stories, the math says ~60 days. But 15% of stories will take 8+ days each. With 12 stories, the chance that zero of them hit the tail is roughly 14%. So 86% of the time, at least one ticket pulls the date out.
Forecasting at p85 builds the tail into the budget. Forecasting at the mean pretends the tail isn’t there.
What p85 means
p85 is the value that 85% of your historical cycle times fall under. If your p85 is 8 days, then 85 out of 100 past tickets finished in 8 days or fewer.
It’s not a target. It’s a description of how your team actually delivers right now.
You can pick a different percentile. Some teams use p70 for early-stage work where speed matters more than predictability. Some use p95 for regulated work where missed dates have legal cost. p85 is the default because it captures most of the tail without being so conservative that the forecast is useless.
The rough math
Forecasting one ticket: the date is “started date + p85 days”. 85% of the time you’re right or early.
Forecasting a sprint of N tickets: it’s not just N × p85, because the tail compounds in your favour (it’s unlikely every ticket hits the tail). Two practical approaches:
Simple version: use median × N + (one or two p85 cycles for safety). For 12 tickets at median 3 days and p85 8 days: 12 × 3 + 2 × (8 - 3) = 46 days. This is rough but better than the mean.
Monte Carlo: simulate the sprint 10,000 times. Each ticket draws a random cycle time from your historical distribution. Take the 85th percentile of the simulated total. This is what /tools/sprint-forecaster does. Use this when stakes are high or you want defensible numbers.
For most teams, the Monte Carlo number is 5-15% higher than naive median-times-N and 5-20% lower than mean-plus-buffer guesses. It’s the smallest forecast you can defend.
The stakeholder message
Switching from “ETA: end of month” to “p85 ETA: end of month, p50 ETA: mid-month” feels like hedging. It isn’t.
Two messages that work:
“We forecast at p85 — the date by which 85% of similar work has historically completed. The actual date may be earlier; the chance it’s later than this is about 15%.”
“Our planning date is X. The optimistic case is Y. We’re sharing both because the gap between them tells you how predictable this kind of work is for us.”
Stakeholders don’t push back on these because they include both numbers. The mistake is sharing only the optimistic one — which is what the average essentially does.
The conversation that earns trust: explain the gap. A wide p50-to-p85 gap signals work that’s hard to estimate (a research-heavy sprint, lots of unfamiliar code). A narrow gap signals stable, well-understood work. Stakeholders learn to read the gap as a confidence signal.
What to actually measure
Track three numbers per ticket type:
- Cycle time: when the work started moving (in-progress) until it shipped.
- Lead time: when the work was committed (sprint start, or backlog acceptance) until it shipped.
- Wait time: lead time minus cycle time. How long the work sat before someone touched it.
Cycle time is the team’s signal. Lead time is the customer’s signal. Wait time is the planning signal — if it’s high, you have too much WIP, not too little capacity.
Compute p85 separately by ticket type. A bug-fix p85 of 2 days is not the same data as a feature p85 of 12 days. Forecasting at the wrong p85 for the wrong ticket type produces dates that look defensible but aren’t.
The /tools/cycle-time-calculator does this analysis for one stream at a time. Paste the data, get p50/p85/p95 plus a tail-shape diagnosis.
Four anti-patterns when teams switch
1. Recomputing p85 every sprint
p85 is a stable property of how you work. It changes when you change something — WIP limits, story sizes, team composition. Recomputing it from a single sprint’s data adds noise without insight. Use a rolling window of 30+ tickets. Update monthly, not weekly.
2. Forecasting at p85 but committing at p50
Some teams compute p85 forecasts, share them with stakeholders, then commit at the optimistic p50 internally because “we should aim higher”. Result: half the sprints miss, the team is permanently behind, the p85 number gets blamed.
If p85 is the forecast, p85 is the commit. The push for ambition belongs in what you commit to, not the date math. Pick fewer stories at p85 rather than the same stories at p50.
3. Filtering out outliers before computing p85
The instinct is to remove the 60-day ticket because “that was a weird one — the dependency took forever”. But the dependency was real, and there will be other 60-day tickets for other real reasons. Outliers are the data. They’re what makes the tail a tail.
Removing outliers gives you a beautiful p85 that doesn’t survive contact with reality.
4. Reporting only p85
The opposite trap: give stakeholders only the conservative number and watch them anchor on it as the expected date. Then the team beats it consistently and stakeholders learn that p85 = sandbagging.
Always share both p50 and p85. Train stakeholders on the gap. The forecast isn’t a single number — it’s a confidence range.
When p85 isn’t enough
Some sprints have so much novelty that historical data doesn’t predict the future. New product area, new tech stack, first sprint after a reorg. Cycle time data from the old context isn’t useful.
In those cases, name it: “Our historical p85 is 8 days, but this sprint involves work we haven’t done before. We expect the actual numbers to be 30-50% higher. We’ll recalibrate after sprint 2.”
The honesty buys more trust than a precise-looking forecast that misses by a week.
The shift
Switching from mean-based to p85 forecasting feels like a meta change. It’s actually a small one. Same data, same teams, slightly different number. What changes is the number of forecasts that come true.
Most teams find that within two sprints of the switch, stakeholder pressure drops noticeably. Not because the dates moved later — sometimes they moved earlier — but because the dates started landing.
Forecasts only have value if they’re trusted. p85 is the cheapest path to forecasts stakeholders can rely on.