Incentivization
— Chapter 7 —
Author’s Note: This chapter is now freely available to the public. All paywalls have been removed.
Going for a promotion in software engineering is difficult work, particularly in technical organizations.
You certainly have to work very hard in order to get a good rating for performance reviews. But just hard work is not enough by itself. An important requirement for promotions and salary raises is a critical contribution to projects with high impact. High impact means the project has to succeed and preferably make the company a lot of money, or at least enable other departments and teams in the company to work better and make more money. (Projects that improve the company infrastructure would count in the latter category.) Critical contribution means you have had a critical role in the success of the project, that the project could not have succeeded without you. If you have had a leadership role that brought together a lot of different cross-functional teams, that is even better, and your contribution is regarded as even more critical.
Sounds kind of reasonable so far, although one can still have some criticisms for this process. If you are a perfectly capable and skilled engineer who ends up on a dead-end project through no fault of your own, your career opportunities would be severely stunted.
In many technical organizations, the criteria for promotions get even more difficult and, in my opinion, a lot more harmful ultimately to the organization’s health.
In a lot of companies, there is yet another important measure for the criticality of your contribution: Your solution needs to have a good amount of complexity. For the software engineers, the more complex and elaborate the software design and implementation is, the higher regard the engineer is held in for performance reviews and promotions.
In some organizations, promotions are decided by promotion-review committees that consist of engineer leaders, managers, and directors from all over the company. These people do not even know the employee whose promotion package they are reviewing. This way the promotions are thought to be conducted in a more impartial way. The promotion package includes written input from the person’s manager and a bunch of coworkers that they have worked together with. All of the reviews have to be stellar for a successful outcome. Everyone needs to mention how critical the employee’s contributions were, what kind of an intricate and complex system they have created, and how much of a great impact they’ve had on the company. Naturally, it is really difficult to get a promotion under this system. You need to make all your managers and coworkers happy, and you need to convince a faceless committee of managers you will never meet that you are ready for a promotion.
Most companies pride themselves in being “metric driven”. This means the annual performance reviews need to include lots of so-called “objective” metrics that describe how effective of a worker the employee is. For the software engineers, these metrics might include the lines of code they produce. For every software engineer, the number of lines of code they have written could be tracked on a daily basis, and used as a metric when evaluating their performance during perf reviews. Some organizations might instead track the number of feature requests implemented or bugs successfully resolved by an engineer.
And unfortunately, there are companies that require their performance reviewers to stack-rank their co-workers from best to worst in terms of how effective of an employee they are.
By the way, everything I mention here is already on the internet, one search query away. None of these should come as a big surprise to anyone.
I believe that such performance review systems have created some really bad incentives among the employees, and caused some real harm to the companies.
For starters, I find stack-ranking of employees to be an awful practice that rots the soul of an organization. Some managers and executives love to install a cut-throat competitive system so that their teams can supposedly become more efficient and work harder/better. In reality, their teams just become more efficient at backstabbing and sabotaging each other. Quality of work can never improve in such a work environment.
Next, let’s look at the perf system that incentivizes the development of complex systems and rewards engineers who write lots of lines of code. Highly complex and convoluted systems are developed in this organization, as a result of this perf system. The systems suffer from over-complexity and over-engineering. Over time, they become harder to understand, harder to maintain, fragile, and bug-prone.
It is not unheard of that some engineers develop some complex system that just works, get their promotions, and then move onto the greener pastures. A few months after getting their promotions (sometimes not even that long), they transfer to another team, leaving the mess behind for someone else to deal with it. But in the end karma eventually catches up with them. The teams they transfer to would also have projects that are developed in a similar fashion by some engineers in a similar mindset that have already left. The company proliferates with such projects, which makes it eventually impossible to escape the overwhelming software complexity and over-engineering.
The turnover rates, engineering product quality, perf results, bugs, postmortems etc are all the consequences of the incentives software organizations create. The management must assess the true impact of their decisions by focusing on the incentives they trigger, not just the goals they want to achieve. If they cannot create the right incentives to reach their goals, then they will waste a lot of time and resources in the process.
A correct incentive system is paramount in an organization, not just for perf and promos, but also for the health of the company. The future of a company depends on how the employees are incentivized to work.
A correct incentive system also helps with the proper motivation of the employees.
The Fallacy of Metric Driven Performance Measurements
How do you measure the performance of a software engineer? What kind of metrics do you use?
That was a trick question. The correct answer is: You cannot use any metrics to measure performance, because any metric can be gamed. The only way to measure performance is to use qualified human judgment. It can only be done by tech leads and managers who are knowledgeable and experienced about software engineering themselves. This is why it is so very important to hire managers who are actually skilled in software engineering rather than filling up your organization with technically clueless ones.
How did I reach these conclusions? Let me show you my reasoning.
Before you ask how good a software engineer is, you need to ask yourself what exactly the job function of a software engineer is. What is the main job of a software engineer?
The main job of a software engineer is to use their expertise in software engineering to help a business achieve their goals. That’s pretty much it.
A software engineer should be using their expertise to find the best solution they can come up with to help the business achieve their business goals.
And if you’ve read all the things I’ve written so far, you can guess what I mean by “best solution”: It is the simplest and cleanest solution that can meet the business goals. It should not be any more complex than is necessary. It should be easy to understand and maintain. Yet, it should be architecturally flexible enough to be extended if the business goals change. It should be open to further change and modification.
Sometimes the business needs are quite complex, so this might indeed mean that you have to come up with an elaborate solution. You might have to design a complex solution, but it should still be easy to understand and maintain. It should never get out of hand.
But sometimes the needed solution can be quite simple. A few hundred lines of code might be all it takes. The fewer lines of code that an engineer has to write to implement a solution, the better it is.
And sometimes you may even have to remove code. You may have to delete the lines of code that are no longer useful. Doing such a cleanup and removing dead code is one of the best things an engineer can do.
Whether a solution is going to take only a few lines of code or whether some code needs to be deleted are all decisions that require a good amount of software engineering expertise. This kind of skill and expertise is only accumulated with long years of experience. This is what needs to be rewarded. These are the kinds of software engineers that need to be promoted.
However, if you are only promoting the engineers who write lots of lines of code and who develop unnecessarily complex systems, then you are going to end up with over-engineered and over-complicated systems that are prone to failure. Then you will have no one else to blame other than yourself as the management.
During perf reviews, engineers should be judged by their software expertise over any other metric. The only way to do this is with qualified human judgment: By using the decision of those engineering leaders and managers who are skilled and experienced in software engineering themselves.
Metrics and “data-driven approach” to perf review are absolutely overrated.
The Futile Quest to Remove Qualified Human Judgment
It is in the very nature of us engineers to trust the machines more than other humans. Let’s admit it: A lot of us are introverts, and this definitely includes me. We are introverts in the sense that dealing with other people kind of tires us. We would rather spend our time debugging an inexplicable server issue than doing any kind of negotiations with another human being. We can’t really help it. It’s just the way we are wired.
Of course to us, it would be so much better if human judgment and decision making would be removed from the equation of performance ratings. It would be so much better if you could just measure some metrics, feed them into an algorithm, and it would spit out how much of a salary increase & stock grant an engineer deserves for that year, and whether or not they should be promoted.
I feel that some companies strive to be metric-driven, because a lot of the engineers working there want the system to be like that. It’s not just the managers and the executives.
It is undeniable that human judgment can be very faulty. It can be full of fallacies, unconscious biases, and cronyism. There is always a chance your manager might have some prejudice against you, and derail your career advancement. It might make sense to a lot of people that the more human judgment can be removed from the equation, the better it would be.
Yes, all of this is absolutely true. There is no way to sugarcoat it. Us humans are fallible and imperfect creatures. Our judgment can be horribly skewed sometimes.
However, even in the most hardcore “metric-driven” organization, there is no way to escape human judgment. During the perf review and promotion process, the promotion committees always ask for input from your manager. If your manager doesn’t like you for some reason, they will always find a way to derail your promotion one way or another, believe me. If you are in such a situation, the best and possibly the only way to deal with it is to transfer to another team or to another company. There is usually no other way. As a side note, the best thing a company can do for the welfare and happiness of their employees is to make the internal transfer process as seamless as possible, and not allow the employee’s prior manager to harass them in their new team. (Yes, this might happen unfortunately.)
It is very very difficult to measure one’s software expertise. Human judgment seems to be all we’ve got. Currently, there is no Machine Learning algorithm that can accurately measure how good of an engineer an employee is. And I don’t expect there to be such an algorithm any time soon. If there was, those machines would be doing all the software design and implementations themselves, and we would be jobless right now, anyways. Despite all the recent buzz about generative AIs as I’m writing this, I believe our jobs as software engineers to be safe for the time being. (Famous last words maybe? We will see in 5-10 years.) I will try to go into detail about the AI driven software development and its particular fallacies in another chapter.
There is no metric that can be measured reliably, and can give a reliable answer about the proficiency of a software engineer.
That is because any such metric can be gamed.
And the act of gaming these metrics could have very destructive consequences for the organization and its employees.
Gaming the Performance Metrics
There are many different ways to game the various performance metrics.
Before I continue, I must make the disclaimer that I absolutely detest every word that I am about to write here. Just the thought of putting those words down makes my stomach turn. I absolutely do not approve or condone any of these metric gaming schemes. However, I need to write about them here to make my point across.
For the number of lines written, the easiest way to game it is to move the code around. Just say this package needs to be refactored and it would serve everyone better if it was in a different location in the package folder structure. Specifically, if you are dealing with large packages of code, moving them around is going to do wonders for your line count.
The thing is, it is indeed sometimes very necessary to refactor the packages and move the code around. This is a very needed part of the code cleanup process to decrease the tech debt. Sometimes you really do have to move the code around. Which makes it really difficult to detect whether this metric is being gamed or not.
The biggest obstacle to gaming this metric and a lot of the other metrics is your code reviewer(s). You just have to convince them that this particular change is really necessary, so they will let you submit your changes to the code repository and increase your line count. In reality, it is not very difficult to convince the code reviewers. Don’t forget that they are also engineers who are very busy with their own work. They don’t have much time to ponder on your change request. If they have a certain level of trust in you like they do in all their coworkers, they will be convinced that this code change was necessary, and let you submit it. And as I said before, sometimes these kinds of changes are indeed really necessary. It is very hard to tell the difference.
As a side note, in a culture of “metric-driven development” where everyone starts gaming the metrics, the trust between the coworkers is going to be eroded very fast. The resulting work environment where everyone is suspicious of each other, where everyone is constantly questioning the real necessity of each other’s work is not going to be a pleasant environment to work in. Just saying.
Some companies can try to find smarter ways to measure the metrics. In this particular case, they can measure the number of lines of deleted code, and subtract it from the number of lines of added code. If a lot of code was simply moved around without any brand new code being added, the resulting difference in lines of code is not going to amount to much.
In this case, you just have to keep adding lots of code to the code base, fast.
Instead of implementing some code in a shorter and simpler way, you can do it in a longer and more complicated way. You can define extra variables, methods, interfaces, and classes that are not really necessary. You just have to convince your code reviewer that they are indeed necessary. And again, sometimes it is really hard to tell the difference. Under some circumstances, you really do have to implement all these extra methods and classes.
You can also just write code very fast without giving much consideration to its consequences. Can this software be designed differently in a simpler, cleaner, or maintainable way? Who cares, just write it. Could this code have any insidious potential bugs that could haunt us later on? Who cares, just write it. As long as your code compiles/builds and passes all the tests, just send it for code review. No need to think too much about it.
And if there are any bugs or issues with the code, then that is even better for you. You will have plenty more opportunities to write even more code as you are fixing those bugs. Since it is your own code, it shouldn’t be too difficult to understand and make the fixes. Besides, when you successfully implement the fixes, you will be hailed as a hero by your team. Nevermind that it was you who added the bugs to the codebase in the first place.
For the sake of your teammates, let’s just hope that the bugs show up before you leave the team and transfer to another team or company.
Some companies may realize that measuring lines of code is not really advantageous, and they may want to find other ways to measure their engineers’ productivity. They might want to measure the number of features delivered and the number of bugs successfully resolved for instance. They might want to measure the mean/average time between the code submissions.
No need to worry. All of those metrics can be gamed as well.
No two feature requests or bugs are created equal. Some features are very easy to implement. Some bugs are very easy to resolve. You just have to make sure you get to work on these things, and leave the more difficult ones to your coworkers.
It is possible, and actually even recommended to split up the difficult feature requests or bugs into smaller and more manageable components, and work on those pieces separately. However, in some cases, you run into a feature request or a bug that is just simply difficult to figure out, and might no longer be possible to split it up any further. It might be a bug that could require you to do long hours of analysis on server logs for instance, and it might not be possible to automate this analysis process. In other words, it could be a feature request or a bug that is a huge time sink. In a “metric driven company”, you may want to avoid such things. Let others work on them and waste their time. Meanwhile you can devote your time to bugs and feature requests that are easy to resolve and implement. This is going to increase your metrics of bugs/feature-requests delivered and the mean time between your code submissions. It is even going to increase your line count, if anyone is still measuring it.
If everyone is likewise avoiding those difficult bugs and issues, they might just linger around the codebase for a while. Your manager, who wants to improve the metrics of everyone under them in order to look good to their own managers/directors, might ask your team to just find some workarounds that address these difficult issues, instead of spending the time to address them properly and fix their root causes. In that case, the future development on this codebase is going to take place on a foundation of software that is full of hacks and potentially cryptic workarounds.
Metric-driven performance has the potential to rot the codebase of an organization and fill its software with unmanageable tech debt.
And worse than that, it has the potential to rot the souls of the employees who inevitably realize that it is in their best interest to start gaming all these performance metrics.
Useful Criteria in Measuring Performance
Forget all these metrics that can be measured by automated means in order to determine an engineer’s performance. Are there any other criteria that a manager can use to measure performance?
Sure there are, and they involve all the ideas and practices that I am trying to convey in this book: The quality, cleanliness, maintainability, and scalability of an engineer’s code, how well of a test coverage the code has, the quality of the design documents the engineer comes up with, the quality of their design reviews when they are reviewing others’ design docs, how much they engage in meetings and provide actual good ideas, how much of a challenging problem they can tackle and solve, and how much they contribute to the innovation and overall well-being of the organization.
As you can see, almost none of these things can be measured with a machine. As I said before, if machines could accurately measure these, then they would be able to do all of these things themselves, and we would all be jobless right now. It is also not possible to game these criteria. There is only one way to increase the quality of one’s code, and that is to write good quality code. Under this system of evaluation, the engineers will want to tackle the challenging software issues that haunt the organization, instead of avoiding them and passing the buck to someone else.
Measuring and evaluating all of these criteria require qualified human judgment. They could only be measured by managers and other coworkers who are skilled in software engineering themselves, who have worked closely with the employee and know their work very well. A faceless committee is never going to cut it.
It would really help if the manager making these decisions is a TLM (Tech Lead & Manager), involved closely with the team projects. If they are not actually doing any coding themselves, they should at least be reviewing the engineers’ code and design docs. They should have no more than 5 people under them. This way they could better keep track of the employees under them and their performance. They could also have some time to do coding and/or code reviews.
The more fair you try to make this performance evaluation process by overcomplicating it with faceless committees, useless metrics, etc., the more unfair it is going to become. Employees are going to start gaming the metrics, and that’s going to have detrimental effects to the organization.
Allegedly, Winston Churchill once said: “Democracy is the worst form of government except for all those other forms that have been tried from time to time.”
We can similarly apply the same logic here: Qualified human judgment with engineering experience is the worst solution for evaluating engineering performance, except for all those other solutions that have been tried from time to time.
In the long term, an engineer’s work will have to speak for itself.
Code Ownership, Autonomy, and Empowerment
Promotions and salary raises are certainly some of the most important ways of incentivizing software engineers. If they are lacking, you will begin to see a lot of engineers leave your organization. But they are not the only ways. Something else needs to be there too, without which there would be no proper job satisfaction and incentivization.
That something is autonomy. Software engineers need to feel empowered in their roles. They need proper levels of autonomy.
And one of the largest contributors to a software engineer’s autonomy is code ownership. The engineer needs to have a proper sense of code ownership. They need to have a certain level of control over their own code.
This is not something that is only beneficial for the engineers. This is also extremely beneficial to the organization.
Importance of Code Ownership
I previously mentioned those extremely large classes that exist in many companies' codebase. They are thousands of lines long and pretty much impossible to understand. I explained the mechanics of how these classes came to be: The Single Responsibility Principle was violated, there was no refactoring done, and the class was allowed to grow into that monstrosity.
I explained the “how”. Now I am going to explain the “why”. Why did these classes come to be?
Almost all of these classes started out neat and small. Most were actually very well designed at the beginning. When you look at their history in the code repository, you can clearly see that. Yet in time, over the course of many years, they grew into those monstrosities.
Without fail, pretty much all of those classes were the ones without clear ownership. That is the core reason why this degradation happened to them.
Those were the type of classes that were used by the various different modules in the project. You could say they were utility classes that contained various different functionalities. A lot of the team members would make use of the functions present in these classes when they were implementing their own code. And in quite a few cases, these classes were used by engineers from various different teams. However, no one in any of the teams had any clear ownership over these classes.
There would come a time when someone needed some more functionality, then they would just add the code to the class. They would then send the code changes to a random member of the team to be reviewed. Oftentimes, that reviewer was the engineer who happened to be on-call during that week, who did not necessarily have any in-depth knowledge about this class. The reviewer would usually approve the changes. Gradually, over time, the class would grow. Since nobody had any clear ownership over that class, nobody would stop to say “this class has grown so much, maybe we should refactor it”. Even if someone said this by any chance, a TODO comment would be added to the class code and forgotten. After all, nobody was really responsible for the class. And everyone had their own work to focus on, their own promotions and salary raises to pursue.
This is a very well known phenomenon. It even has a name: Tragedy of the Commons.1
Does this phenomenon occur with every utility class in a company codebase? Of course not. There are some utility classes that are used by the entire company, yet they stay clean and well maintained. The main reason is that those particular classes usually have a clear ownership and have people that are taking good care of them. If anyone wants to add some random functionality to any of those classes without any good reason, their change request would certainly be rejected by the owners of the classes.
Unfortunately, the Agile philosophy prefers collective code ownership. Agile extols the virtue of every team member owning every bit of code in the team project’s code repository. The entire team is responsible for the entire codebase. Agile claims that this will result in better overall quality of code.
My observation has been the opposite. When everyone is responsible for something, then no one really is. The tragedy of the commons kicks in very fast, especially for the code that is used by everyone.
Every piece of code needs to have one clear owner who is responsible for taking care of it. That owner needs to be one individual person, not a group or a team. Responsible ownership of software should be rewarded and incentivized in the organization.
Now, there are caveats and limits to what I mean by ownership. I am not saying that only one person is allowed to make changes to that code. In this regard, Agile philosophy and I are in agreement: Every team member should be allowed to make changes to any file as necessary. However, if you make any changes to a file, you need to send your change request to the owner of the file for a code review. Only if the owner determines that the changes are appropriate, then they can be merged into the code repository. The owner is responsible for doing rigorous code reviews for the change requests. The code reviews should also be done in a timely fashion, within a day or two at most if possible.
If it is the owner that is making changes to their own files, then they are still required to submit their changes for a code review. I believe that the owners of components that interact closely with each other should know about the codebase of each other’s components. This is best done if they are doing code reviews on each other’s codebase on a regular basis. They should even be encouraged to contribute to each other’s codebase from time to time. This way, if a person suddenly decides to leave the team one day, there is still going to be someone else ready to take over the ownership of their code. These practices ensure that there is a proper balance of ownership and sharing of code.
I just said that responsible ownership of code should be rewarded and incentivized. Does this mean that if there are a lot of bugs in a software component, then its owner should be penalized during performance reviews? Not necessarily. No two software components or systems are created equal. Some might be more prone to issues than others. For instance, if some software is doing distributed transactions, it might be more prone to failure due to network problems, etc. It might be more challenging to design and develop software for certain subsystems or components. Their owners shouldn’t be penalized for this extra challenge, if anything, they should be rewarded. This is where qualified human judgment comes into play again. During perf reviews, the tech lead and coworkers of the engineer should judge and determine whether the engineer was able to tackle development on such a challenging system, and whether the issues were being successfully dealt with. Again, this is not something that can be measured with a simple metric. Measuring software quality is a challenge that requires human judgment. And that judgment can only come from people with a good amount of software engineering experience.
There is another phenomenon that we should pay attention to: It is named Conway’s Law after the computer scientist Melvin Conway. It states “organizations design systems that mirror their own communication structure”.2 Simply put, however your team structure is, that is how your software’s structure is going to end up. One common example is: If there are 4 engineers on your team working on a compiler, that compiler is going to have 4 stages.
For better or worse, Conway’s Law is what it is. It should not be overlooked. In my opinion, organizations shouldn’t fight it, but instead work with it. However your software system is structured, the team should be structured accordingly. It pays to do some careful thinking and architecting of how a software system is going to be designed at the beginning of the project, and then arrange the team structure and software ownership among the engineers according to the software system architecture. And if the architecture undergoes some (hopefully minor) changes during the iterative development process, then the ownership structure should be altered matching those changes. Software engineers would be understanding of such changes, since dealing with a changing and dynamic software is part of their job description.
Software ownership enables the engineers to be proud of their work. It makes the engineers care more about the software they develop and produce. If the engineers know how the software they own is contributing to the shared goal of the organization, they would have more sense of achievement. They would be a lot more motivated to care for that software.
Conversely, if there is no sense of ownership, the engineers would be under the impression that ultimately nobody really cares for the software that they develop. If anybody can come and freely add some code to a file without the engineer’s approval, and end up increasing the bugs and tech debt in that file, then the engineer would lose all motivation to care for that software or anything else in the company for that matter.
People need a sense of control and autonomy over their own work.
People need a sense of purpose. They want to see that their hard work actually contributes to something and makes a difference somewhere.
Empowerment of Engineers
Software ownership is certainly a huge part of an engineer’s autonomy and empowerment. A software engineer should have ownership over their own code. But that is only the beginning of the engineer’s empowerment.
Engineers should also have a degree of autonomy over the decisions that affect the design and implementation of their software. They should be given autonomy to make decisions over their own product. After all, If they are given the responsibility to develop a product, they should also be given the proper authority to make decisions that affect the development of the product. That is not only appropriate and fair, it is also extremely beneficial for the organization.
Engineers, particularly those above a certain experience and skill level, should even have a degree of power to contribute to the decisions made at the organization level. They should have a say in the decisions that affect them.
Engineers should be sitting at the decision table, not sitting by the wall listening in.
If the software is being developed for an external customer, it might also be a very good idea to have the relevant engineers participate in the meetings with the customer. Handling customer expectations is a crucial part of the product management. Engineers can be extremely helpful in managing the customer’s expectations. Only an engineer who is highly involved in the design and development of a project would know about the decisions and tradeoffs that affect the product. Especially if it’s an experienced engineer who was previously involved with developing similar products.
As an old saying goes, with great power comes great responsibility. If an engineer is engaging in the meetings with the client and providing viewpoints on the product development planning, that engineer should also have the utmost integrity and honesty in their dealings with the client. They should provide an honest account of all the tradeoffs. And most importantly, they should never make promises to the client that they cannot keep. To give you an idea, it would be highly irresponsible to promise to finish a project in a couple of weeks, when in reality it would probably take a couple of months or even quarters to finish it. This would not only destroy the engineer’s reputation, it would also have an adverse effect on the company’s reputation.
Empowered engineers are more satisfied with their jobs. Only happy and empowered employees are motivated. And only motivated employees innovate and make amazing contributions to their organizations. The empowerment and job satisfaction of the employees improves the organization and drives innovation. If there are no such employees, the organization is not going to have any innovation. Such an organization is going to die out sooner or later.
Motivation is one of the most important things when it comes to work. In my opinion, it is even more important than skill or talent. If the employees are motivated and if they are given enough opportunities, they are going to learn and acquire all the necessary skills on the job anyway. Us humans are not static creatures. We always have the capability to learn new skills and change ourselves for the better. The most important driver of this change is motivation.
There is this old quote that I always remember from the French writer and aviator Antoine de Saint-Exupéry who also happens to be the author of the book The Little Prince: “If you want to build a ship, don't drum up the men to gather wood, divide the work and give orders. Instead, teach them to yearn for the vast and endless sea.”
Incentivizing the Management
None of the things I mentioned in this chapter or in this book are going to matter if the executives and managers are not onboard. They also need to be incentivized.
Of course, the rank and file engineers in the company have a lot of power too. They can push back to the management and have an undeniable effect on the policy decisions, even among the highest levels of executives.
However, implementing an incentive mechanism for the executives and managers would be a huge help. A mechanism to keep them incentivized at valuing software quality in their company and in their teams’ projects.
I have just such a mechanism in mind.
Software engineers feel the lack of software quality most when they are on-call. In almost all teams with a customer facing product (and this could be an internal company customer too), there are on-call lists. Each week a member of the team is designated to be on-call. If something bad happens and the server starts failing in a way that could affect multiple customers, the on-call engineer is paged by an automated system. No matter what time it is, even if it is 3 am on a Saturday, the engineer is going to receive a phone call from an automated messaging system that uses a text-to-speech synthesizer to explain the issue briefly and ask for an acknowledgement. The on-call engineer then presses a button on the phone to acknowledge the page, hangs up the phone, opens up their laptop, and gets to business. At this point they can try to quickly debug the issue, see what the problem is, see if there is any quick way to mitigate the issue, and try to get the server running again. They might also update the automated paging system to silence the pages for a certain amount of hours, so they won’t be paged again with the same issue while they are working on it. When the on-call engineer gets back to work on a regular business day, they can discuss with the rest of the team about the root causes of the issue, and see if they can find a more permanent solution to fix the problem other than a quick mitigating fix.
If no one answers the page, the automated system keeps trying periodically after a certain amount of time, such as once or twice every hour. Most on-call lists have a secondary on-call engineer in place. If the first on-call engineer fails to answer the call, the second person is going to get the same page.
This is pretty much how an on-call system is implemented in a lot of tech companies that host a customer-facing server. If the project has a good level of software quality, the on-call engineer is going to be lucky, and might not even get a single page during their on-call week. However, if the software quality is severely lacking, there are possibly going to be multiple pages on most days (or nights) during the on-call week.
Thus, the software engineers that are on the on-call list bear the brunt of the bad software quality closely at first-hand. If you want to know how well a system is designed and implemented, just ask the on-call engineers on that team.
Here is my idea: Why not make the executives and managers feel this too? Why not add them to the on-call process?
We could create a separate on-call list for the executives and managers that are responsible for a particular project. Each week one of them is going to be on-call.
Just like the engineers, they are going to have to respond to the pages at the odd hours of the night. Unlike the engineers, they don’t have to actually open up a laptop and try to debug the issue. We cannot expect a VP or a director of the company to know about the fine details of the server implementation after all. (Even though that wouldn’t be such a bad thing.) All they have to do is to answer the call and acknowledge the page, that’s all.
Here is the crucial part though: The page is going to go to the executive first. If they answer the page properly, only then the actual on-call engineer is going to get paged next. However, if the executive never responds to the page, then the on-call engineer is never going to be bothered.
If the page isn’t answered, the automated system could keep trying again periodically, a couple of times an hour, like in the regular on-call system. There could even be two levels to the executive on-call list: So if the first executive misses the page, a second one gets paged.
However, unless an executive wakes up at 3 am on a Saturday and responds to the page, then no on-call engineer is going to get paged. That is the rule.
I can pretty much guarantee that this on-call mechanism is going to make the executives aware of the software quality of their teams’ projects. After a few on-call cycles, they will know whether that quality is lacking, and how much of it is lacking. If they are aware of this, they might be incentivized to give software quality the priority that it deserves.
Now, I can already hear some of you readers yell at me: “You are such a hypocrite! You told us not to use any metrics, yet the on-call pages are a metric.”
First of all, it is true, I could be a bit inconsistent sometimes. After all, I am only a fallible human being, like all other humans.
But second of all, I am not saying we should use this metric to determine perf review scores, salary raises, or promotions. Not at all. That would probably be a very bad idea. The postmortems for server failures are supposed to be blameless. Measuring this metric for some particular people during perf reviews would essentially equate to assigning the blame for the server failures to those people. Also, some servers could be more prone to failures by their very nature, if they are implementing complex distributed transactions, or if they depend on other third party services which themselves could fail, for instance. Infrequent server failures and on-call pages could be inevitable in some cases. (Infrequent should be the operative word here. Very very infrequent.) It would probably be very detrimental to the company if anyone started to game this metric for their performance reviews.
Therefore it would be best to just stick to using this metric to page executives and managers. Nothing more.
I could argue that anyone who has ever asked a software engineer “when are you going to be done with this task?” should be added to this on-call list. Anyone other than an actual customer that is.
With this kind of incentive mechanism in place, the management is going to get a little taste of how much quality their software has in production. They are going to have some skin in the game, so to speak.
Who knows, they might eventually get tired of getting paged too much, and perhaps even start prioritizing the software quality. Better late than never.
“Tragedy of the commons.” Wikipedia, https://en.wikipedia.org/wiki/Tragedy_of_the_commons. Accessed 21 November 2023.
Conway, Melvin E. “How Do Committees Invent?” Datamation, 1968.


