On a mission to level up Engineering
Do not index
Do not index
Related to Authors (1) (Content)
In software engineering, measuring performance is crucial to understand how effectively teams deliver value and address customer challenges. To evaluate the effectiveness of engineering efforts, it’s important to establish meaningful metrics beyond completing workflow tasks. This article explores three essential metrics — speed, accuracy, and recovery — that provide valuable insights into engineering performance. By understanding and tracking these metrics, organizations can comprehensively view their engineering capabilities and drive continuous improvement.
First, we must agree on what Engineering is — a tool that takes customer challenges and produces and maintains solutions for those challenges. This tool usually prioritizes work through a roadmap that both Engineering and Product contribute too, but this is just how the tool works, not why it exists.
I’ve seen my fair share of Engineers who take on a ticket, write some code, push it, and then move the ticket to the next step in the workflow, thinking their work is finished. There is nothing inherently wrong with working like this, but it does have some downsides:
- The next step in the workflow (usually Quality Assurance) might bounce the task back.
- The deployment did not go as expected.
- The code might not work under a production load.
- The product team might have expected something else and bounced the task back.
- The product team might need to iterate over this functionality to resolve a customer challenge fully.
A task should be considered as done when it’s in front of a customer and it solves the challenge it was intended to solve.
Now that we have set the scene, the first metric becomes apparent — cycle time. Cycle time shows how long it takes for an Engineer to move a task in progress to the solution being in front of the customer and solving the customer's pain point. This measurement usually encompasses the work of multiple people or teams of people:
- How long does it take to write the code — Engineers
- How long does it take to test — Quality Assurance / Product
- How long does it take to deploy — Infrastructure
Knowing the cycle time allows you to drill into it, see where most of the time was spent, and optimize accordingly.
A small cycle time allows Engineering to produce value quickly and often — it shows the speed of Engineering.
Once you know how fast you move, you should focus on how accurately you move.
This metric, called the Change failure rate, shows how often changes that reach end-users require rollbacks, hotfixes, or other remedies. You can use this metric to measure how well set up your workflow of getting code from a developer's laptop, through all quality checks, and in front of customers. A high change failure rate can mean many things:
- Engineers could use some guidance on the technical side of things
- Quality assurance should take into consideration more aspects when testing things
- The input given to Engineering from Product might need to be refined
Knowing this metric shows you how accurately Engineering moves while drilling down helps you optimize your workflow. Having a low change failure rate helps with speed as well.
Something will slip through the cracks regardless of how well your workflows are set up. You need to prepare for this, and we need to measure how long it takes for a problem to appear to it being solved. Problems can be an array of things — a new deployment not working, a service or database buckling under load, or that new fancy feature breaking something critical. For this, we use Mean Time to Resolution.
A high mean time for resolution can hint at problems in different areas:
- Your infrastructure might need changes, e.g., supporting blue/green deployment or faster rollbacks.
- Perhaps your monitoring needs to be improved to catch these issues before they become critical — e.g., monitoring the load of services and databases.
- Perhaps your observability (logs, traces, and metrics) needs changes to identify problems better.
- Perhaps your deployment times need to be better aligned with people's working hours
- Maybe the path Engineers take for hotfixes could be sped up.
A short resolution time shows how well Engineering is equipped to deal with unforeseen circumstances. It also has an added mental benefit — your team knows they can recover from issues quickly and will not cause major frustrations to your users.
These three metrics — speed, accuracy, and recovery — provide a solid starting point for assessing Engineering performance. While numerous other metrics are available to fine-tune processes, these initial measurements offer valuable insights into the effectiveness of Engineering practices. By focusing on speed, accuracy, and recovery, organizations can lay a foundation for continuous improvement, enabling Engineering teams to deliver optimal value to customers.