The Agile process is very good at predicting what a given team is likely to deliver in the near future, especially if that team is a few sprints into a project. We come to rely on the practice of story point estimation and establishing a team velocity (i.e., story points completed in a sprint) in order to understand when a feature or release will be completed. With these simple tools we can more reliably answer our business partner’s eternal question: “When are you going to be done?”

But what about when we are called upon to predict what an arbitrary team is likely to accomplish in an unfamiliar problem domain based on a project description of dubious quality? High-functioning Agile shops don’t have this problem because they reduce variability wherever they can. They keep their teams consistent, their problem domain is well understood and their stakeholders support and adhere to Agile values. Many of us are not so fortunate.

As consultants, even when we have a chance to keep our teams together, we will always be changing contexts and exploring new problem domains. The novelty of learning something brand new is the main reason that I choose to be a consultant. A significant drawback is that in order to win this novel work, we must be able to answer the “when will this be done” question along with the “how much will this cost” question before we ever begin and often before we even know what “this” is.

We don’t have a good system yet for making these determinations with perfect accuracy, but we do have some practices that we use to mitigate our risks.

Wideband Delphi/Proxy Metrics

In Software Estimation: Demystifying the Black Art, author Steve McConnell describes the process and efficacy of several different estimation methods. The one that the industry seems to have the most success with is “Wideband Delphi.” This method takes many forms, but they are all based on building consensus around independently developed estimates. The most popular form seems to be Planning Poker, where team members discuss a user story, ask questions of the Product Owner, and then privately select an estimate. All estimates are revealed at the same time and then the highest and lowest estimators discuss their reasoning for the estimate they selected. The team continues to play “hands” until they come to a consensus estimate.

The Wideband Delphi method has several advantages:

  • Reduces the influence of individuals because each person develops their estimate independently.
  • Fosters communication between the team.
  • Surfaces aspects of functionality the individuals may overlook.
  • Reduces deviation between optimistic and pessimistic estimators.

Proxy Metrics often go hand-in-hand with Planning Poker and typically take the form of “t-shirt” sizing or “story points.” People are much better at determining relative size than they are at guessing actual size. We might be misled by our perceptions, but we are generally much more successful identifying the larger of two cups than we ever will be at guessing how many ounces either one holds.

This is where proxy measurements can help. By estimating in relative proxy sizes rather than hours, we avoid specificity that we actually have little power to guess. It is usually a trivial effort for us to compare one bit of functionality against another and decide which is the larger of the two and give it an arbitrary value in relation to another.

The temptation, especially for traditional project managers, is to translate the proxy estimate into some sort of hour estimate. After all, that is the goal when responding to an RFP or planning a release. However, we suggest that you hold them at bay for just a bit. This is why some teams eschew numbers altogether and use metrics like Extra Small, Small, Medium, Large, Extra Large (hence “t-shirt sizing”). We prefer to use story points because they directly translate into velocity once the sprints start. Count up the points assigned to the stories completed in a sprint and there’s your velocity. However, there is certainly validity in selecting t-shirt sizing, especially in organizations that are resistant to proxy estimates.

Once the proposed backlog has been estimated in story points, we go ahead and ask a small number of our staff to provide independent bottom-up range estimates in hours (best, likely, worst) for the proposed backlog. This provides us another opportunity to discuss the outliers, discover errors in our thinking and, again, come to a consensus for the number we should propose.

Historical Data

The trend of velocity makes it abundantly clear how much work a team can get done in an iteration for a given problem domain. By the time a team has completed two or three sprints, they are far enough down the cone of uncertainty that they can more reliably predict what they are likely to finish in the upcoming sprint as long as their variability remains relatively constant. If the team is disrupted, the product owner changes, or a new tool is introduced, then that variability will certainly affect velocity. This is why velocity and story points cannot be transferred between teams on different projects. The biases of each group of people are distinct to a particular working environment. There is no way to translate one team’s 50 story points into another team’s.

This is why estimating projects in the abstract, such as when we are responding to an RFP, is so fraught with error. However, because we are blessed with a fairly stable technical staff, we try to mitigate the errors in this process by using historical project data along with some statistical magic to try to help us triangulate our estimates for work that we bid on.

In TFS, we store an original story point value for each user story work item in a project. We also store an associated task work item that contains the sum of all hours it took us to develop that story. We then calculate a standard deviation between the actual hours for all of the one-point stories, the two-point stories, the three-point stories, etc. The errors in our biases then become the basis for a Monte Carlo simulator that throws random numbers at each proposed story we are bidding on. Thanks here go to Northwest Cadence for the sweet Excel spreadsheet macro that does the heavy lifting simulation work.

While I would not want to use this as the sole basis for an estimate, it makes for a good check against the bottom-up estimates that we develop. If there is an order of magnitude difference between them, we can ask some deeper questions about what we might be missing.

The Discovery Period

Winning the new work is only the first step of a long journey. The team still has no empirical evidence to help predict how they are going to perform in this environment. We still don’t know what we don’t know about the project: the hidden business rules, the subject matter experts that are resistant to change, the testing environments that have yet to be provisioned, etc. We all have our own list that always seems to creep past our well-considered set of binding assumptions.

Whenever possible, we structure our contracts with a discovery period to help understand the new forces of variability and get us further down the cone of uncertainty. During this “sprint zero” period, we work with the product owner to help refine the backlog; we dig into the requirements and get to know the key stakeholders so that we can get a sense for how much chaos we might be dealing with. While these activities are underway, we do everything in our power to actually build something.

The single data point of velocity isn’t much and it will change (usually for the better), but having even this morsel of empirical evidence goes a long way toward knowing whether our velocity is sufficient to get the job done.


Once a project is done, it is important to create a feedback loop that takes a look at the margin of error between what was predicted and what actually happened as part of a retrospective. Identifying and, if possible quantifying, those things that contributed to a project going over the original estimates or staying under for that matter (but let’s face it, it’s probably going to be the former), and keeping a record that the team can refer to in the future.

Now it’s time to punch it and get back to work.


Leave a Reply