Growing Software Quality

A building over a vast hidden foundation, evoking how software quality compounds beneath the surface.

The larger an organization grows, the more challenging it becomes to maintain a high standard of quality. Features that once worked reliably start to break. Software once known for great design and ease of use starts to look as if it were laid out by a committee. And, finally, new features often fail to meet customer expectations.

Many factors contribute to a lapse in quality, and it's important to properly diagnose what's happening before taking drastic action. Despite what many businesses do, a reflexive change in software development methodology is unlikely to result in a smooth or sustainable improvement in quality. Jumping to a solution usually indicates a lack of understanding of the problem.

Leadership

The most upstream factor in quality is how the executive leadership team speaks and acts. When the people directly responsible for quality receive visible respect from the leadership team, the rest of the organization falls in line. When the executives treat quality as a cost center, blocker, or a low-status occupation, the organization also reads that signal.

When the status of your quality engineers is low (or there is no quality engineer on staff), it usually means that quality doesn't have a seat at the leadership table. When that happens, quality usually doesn't have a direct line of communication to the CEO. Their work can make their bosses look bad, which is why the CEO is often the last to know about an impending disaster.

I have helped numerous organizations where shipping or selling was the sole focus of executive attention until their customers revolted over poor quality. Those organizations had developed an arrogance or blindness towards quality that led them to ship disappointing releases to production repeatedly.

The quality engineers in those organizations (when they had one on staff) were reduced to simply documenting defects that would be addressed only after enough customers complained. By that point, the defects were often more challenging to isolate and fix. This often led to a degradation in shipping speed, as quality problems compounded over time and eventually spiraled out of control.

These patterns repeat. Three versions I've seen play out:

I remember one product that perpetually seemed two weeks from shipping. The director would hold weekly all-hands status meetings to evaluate and explain what remained to be done. As deadlines passed, the CEO joined the meeting. The focus was always on feature completion, and each major part of the project was seen as a milestone.

However, the underlying issue was an incredibly high defect rate. For every step forward, another defect was identified, and in many cases, created. The team was building on a shifting, unsteady foundation. Instead of fixing the foundation, they were adding new floors because that was what the director and CEO were measuring.

Under that pressure, the team eventually did ship. The shipping criteria had been reduced to basic functionality, resulting in chaos. The customers hated the release. The backlash drove the CEO to issue a public apology, and the director was ultimately fired.

It isn't an uncommon arc. Another business I worked with thought their product was "basically done." The evidence? A demo video created by the contract firm that created the software. My evaluation quickly uncovered the truth: the software was a hollow shell with no end-to-end functionality. They scrapped and re-wrote the software.

In a third example, the business shipped its software to beta users, but it crashed more often than it worked. The engineers responsible threw up their hands in defeat and moved on, forcing a new team to clean up the mess.

Middle Management Pressures

Almost every leader of a software organization puts significant thought into their features and their deadlines. They know their job is to ship new products and features. They also generally recognize that deadlines help to control costs and maintain a competitive velocity.

As with executives, managers and product managers sometimes assume a high level of quality even without their direct attention to the matter. They think of quality as an aspect of thoughtfulness and discipline rather than the result of a process. There is a tendency to blame lapses in quality on the developer who wrote the code, when the real blame often stems from priorities.

There has long been the idea that software organizations can pick any two: features, quality level, and completion date.

True at one level — and the short-term cost is real: tests, CI, and refactor work slow this quarter. But quality compounds into shipping speed across quarters. You can almost always build faster on a high-quality foundation, and the tools that improve quality (tests, CI, code review) are the same tools that compress release cycles. Pick-two isn't wrong, but the three-way tradeoff is less rigid than it looks.

pay the quality tax now · ship faster once it compounds

Low quality means rework, and rework costs time and money. But that's not widely discussed at the manager level or above. At that level, they're thinking more about the product than the process of building it. And so they're concerned about features and deadlines, not the quality guardrails.

And so, when an organization prioritizes shipping over quality, the result is often less shipping. Successful CI/CD (Continuous Integration / Continuous Delivery) organizations ship fast because they've invested heavily in tooling and ops teams to keep the software working — the pipeline itself is a major part of their quality process. Even so, you'll notice that the large "move fast and break things" cultures still have downtime.

Technical debt is a budget category, not an engineering complaint. The team carrying significant debt is slow because every change touches a foundation that wasn't built for the current load — not because they're undisciplined. Treating debt paydown as a first-class line in the engineering budget — say, ten to twenty percent of capacity each sprint — is how scaling organizations avoid the foundation-vs-floors trap I described above. One caveat: sometimes "technical debt" has more to do with comfort or "not invented by me" syndrome; it's important to figure out which.

The other middle management challenge is the budget. Even when the manager recognizes a quality issue, they don't have the right team to fix it. You might have heard an engineer say something like "if you write the cleverest code you're capable of, you won't be clever enough to debug it". When your team builds something intractably complicated, it will not be cheap to fix.

I have had the job of cleaning up these complicated codebases, and it often starts with a big sheet of paper where I map out the mess and plan out a gradual transition to a maintainable design.

Just imagine the situation a manager finds themselves in. You have a large, expensive team of engineers working frantically on a project. Deadlines are being missed. Progress is difficult to show. Eventually, you conclude you need to hire a more experienced engineer (or team of engineers) to untangle this mess. In other words, you want to increase your budget (and when you miss deadlines, you've already gone over budget) by an even larger amount.

If you ask the CEO to hire someone in the deep six-figure range to fix a late project, they won't be happy. How, they'll ask, did it take you so long to see your team wasn't cut out for the work? They might also give you an ultimatum: either make do with the people you have, or let enough people go to pay for the new hire.

The cheapest fix is the one that doesn't require this hire in the first place. Engineers who care about quality reveal themselves in interviews. When you ask them what kind of work they love to do and hate to do, see where they categorize testing and tooling. In their code examples, see what automated tests and logging they included. Did their code run without warnings? Did it handle error cases, or ignore them?

It can be instructive to have candidates walk you through how they tested their code, or to ask them how they would approach fixing a bug that only impacts a small percentage of users.

The best candidates will express interest in software quality and not see it as a chore. You don't want to hire someone who has a habit of plowing through bugs without considering how their work could improve the product's overall quality and provide better insights for future debugging.

I remember working with a brilliant engineer who loved creating beautiful user interfaces, but had little interest in fixing the bugs his code sometimes created. It can be ok to have one person like that if someone always checks their work, but it's a delicate situation. It is easy to send a signal that quality isn't valued in your organization.

What is Quality in Software?

Engineers call this "code smell" — software they don't fully trust, often before they can name why. This idea of quality as something that isn't seen, touched, or heard reflects the difficulty in measuring and communicating it.

While there are metrics you can collect to indirectly measure quality, they can be gamed like anything else. Numbers don't convey nuance, and the word "quality" means different things to different people and organizations. It's an abstract term that you and your organization need to define concretely. For some businesses, quality revolves around great design rather than functional defects. In others, the product's appearance has little value compared to functional concerns such as output quality or data safety.

Quality also starts before code is written. Product specs that omit edge cases, architecture decisions that defer hard tradeoffs, and design reviews that nobody attends all leak into the build. The cheapest defect to fix is the one prevented in the spec.

In the end, your definition needs to usefully gate software releases and inform the work going into the next release. To get there, you first need to understand what you're aiming for.

Prioritizing this list for your organization and your ideal customers will help you arrive at your own definition of quality:

Correct output
Data integrity
Workflow integrity (i.e., the ability of your customers to rely on the same instructions to get the same output even as you revise the product)
User experience
Performance and speed
Uptime or reliability
Scalability
Security
Fit for purpose (i.e., the product meets customer needs or expectations)
Sales and marketing congruity (i.e., the product matches the promise)

When an organization can't speak clearly about what quality means, every failure mode wastes resources on activities that don't matter.

For example, I've worked on projects where key customers complained about broken functionality. Meanwhile, a large amount of development time was spent on design and visual defects. It was clear at the executive level that the functionality took precedence, but the rest of the organization refused to divert resources from the visual design until the functionality could be repaired.

If you can't align an organization around these priorities, quality becomes a slippery concept. Write a short document outlining your quality priorities based on your customer's values. At a minimum, make it clear where the line is drawn: what kinds of quality issues are unacceptable.

This document will almost certainly evolve over time as your business evolves. Expect changes, but start training your new and existing hires on it today.

Balance

Software quality doesn't trump revenue. You are running a business after all; releases must happen.

However, when you're losing deals or churning customers because your software doesn't meet expectations, quality and revenue become tied together. Customers only tolerate a certain level of poorly-made software. This minimum level of quality depends highly on your industry and competition, but it always exists.

The challenge you'll find with poor quality is that improving it takes time and effort. Software isn't like making cars, where improving how the headlights are installed might have an immediate impact on defect levels. Software is something built in layers, like a skyscraper. If the foundation starts to sink, as with the Millennium Tower in San Francisco, the fix can be costly and difficult to implement. Process improvements can only reduce future software bugs.

So, no matter what your current relationship to software quality is, you should strive to balance quality with the speed and cost of your operation. In short, you need to establish a quality process, quality standards, and a quality budget that keep your product well above the minimum customer expectations.

The same balance applies to where customer expectations are set: in your marketing and sales. Your sales funnel shouldn't write checks your product team can't cash. This is a key reason the CEO must be the defender of the definition of quality: from the customer's perspective, a marketing exaggeration has the same negative impact as a product defect.

Quality Strategies

How can you improve product quality in a large organization?

CEOs Drive Quality

Just as the CEO drives other aspects of business strategy, they must also drive and defend the quality strategy. A company's business strategy almost always depends on matching the customer's definition of quality.

Enterprise customers, for instance, may judge an internet hosting provider by high standards for compliance and uptime. Small businesses might judge hosting by ease of use and low cost. Consumers might judge hosting by the level of engagement their site gets from other consumers.

An organization that doesn't value the quality its customers demand will not adequately respond to their needs. Measures like churn often reflect this mismatch in expected quality and delivered quality.

To keep the gap between customer expectations and product reality closed, an organization can't continually debate whether, say, visual or functional defects have higher priority. The answers to quality questions should be known to everyone in the organization — grey areas aside.

When defects are discovered, the vast majority should be prioritized without more than a moment's thought. The prioritization of most defects should get the same result if answered by the most junior software developer or the CEO themself. This can only happen with training and reinforcement.

The quality policy lands in the team's day-to-day work as the Definition of Done — the per-feature acceptance criteria that say "we shipped, but did we actually finish?" Without a clear Definition of Done, the same prioritization debates rehash for every release.

Organizations that don't propagate those priorities from the CEO's office down waste time rehashing and re-prosecuting the same questions over and over again. While it can sound like overkill for the CEO to publicly endorse the company's approach to quality, it's the only way to provide a definitive answer to questions about quality when someone's bonus or promotion is on the line.

The CEO doesn't have to create the quality strategy, but they do have to endorse it and reinforce it with the power of their office. They also need to be able to explain publicly, in plain language, how they think about quality.

For example, if uptime is critical to customers, the CEO might say, "I know that going from 99.9% uptime to 99.8% uptime doesn't sound big. It's measured in hours per year. But our best customers run millions in revenue per hour on our platform. That 0.1% difference in uptime will cost our customers tens of millions of dollars. The work our software and operations teams put into uptime is incredibly valuable because it keeps us competitive, and I'd much rather delay a release than take the hit of an additional hour of downtime this year."

Build Automatic Guardrails

We are fortunate in the software industry that the skills we use to build products can also be used to build automated systems that help drive quality. We're even more fortunate that, because duplicating software costs almost nothing, there are many free and open-source tools available to help us.

The most fundamental of these tools is build servers and CI/CD pipelines that integrate with your version control system to automatically run tooling when new software is written or on a periodic basis. These systems can provide rapid, automatic feedback on every change made to your software. More importantly, these tools can make it impossible to add certain kinds of bugs to your product.

If you don't have computers automatically identifying and rejecting low-quality software changes, your software development process is behind the times.

Build Human Guardrails

First, peer code review, often through a pull request, can greatly improve quality if your engineering team's culture genuinely values giving and receiving constructive feedback. A second or third set of eyes can discover problems that might not be readily apparent from the perspective of the person making the change.

For example, it is easy to miss how a software change might call for a change to documentation, or the author might have accidentally duplicated functionality that exists in a part of the project the author wasn't aware of.

Even if no defects are found, peer code review also helps to keep colleagues up to date on how the product is changing.

Second, specifically give the power to gate releases to a quality engineer. If a release doesn't meet the company's definition of quality, they should be able to hold the release until it does (or, in a CI/CD environment, roll back a release).

This isn't a power to be used lightly. Unless the customers are design maniacs, a release shouldn't be stopped because there isn't enough whitespace. When the power is used, the reasoning should be clearly documented and reference your corporate quality document. A meeting should be held, formally or informally, to ensure that forward progress is being made on the release: either the product team fixes the release, or the quality engineer re-evaluates the impact of their observation.

The release gate isn't meant to stop every defect. Making two steps forward and one step back on a release is fine if that step back doesn't significantly impact customers or reputation. Along with your quality policy, your gate's operation must be calibrated over time. You need to strike a reasonable balance between shipping product and shipping perfection.

Build Monitoring Systems

There is a space between predicting quality issues and reacting to customer reports. Modern production observability — error tracking, real-time logging, request tracing, real-time alerting — gives the organization a view into quality lapses before they reach customer support.

Organizations serious about quality treat observability as part of the product, not an afterthought in instrumentation. For instance, I recommend that all error cases be logged, even errors you don't expect to see in production. As a system scales, the unlikely becomes inevitable. When the unexpected happens, you want to be able to find evidence.

A vendor once demoed a simplified way to implement a common design pattern. They released example code, so I used it the next time I had the chance. It seemed easier to understand and passed all the tests, so it was released to production.

Guess what? That code crashed maybe 1% of the time. We had a large user base, so it quickly became the number one cause of crashes. Because I used the exact code from the vendor's example, I had no reason to expect a crash. It's a good thing we captured the data so we could fix the issue before many customers noticed.

Monitoring shouldn't be limited to crashes. You need to monitor error conditions, unexpected states, even business metrics like account creation, events that create value for the customer, and account churn.

One of my startup friends had an analytics wall in their office. It had eight monitors displaying graphs and metrics, and an array of multi-colored LED lights that blinked different colors for different events. Their team could glance at the LEDs and see the health of their product in real time based on the blend of colors.

Another startup had an internal business page that any employee could view. It showed revenue numbers from the founding of the company to yesterday's stats, with enough detail that any change in trend was easy to spot.

Without these modern monitoring systems, you won't be able to identify quality lapses before your customers do as you scale.

Respond When Things Break

Every organization ships defects. The reaction to those defects defines the kind of quality culture you have.

The more critical the defect, the more important the process becomes. The first priority when a severe problem is identified in production is to define its scope. Are new files longer than 100 kilobytes getting truncated? Have hackers gained access to the customer database? Knowing the scope is the first step to defining what an acceptable fix looks like.

I remember one security incident I worked on in which a product manager inserted themselves between us and the CEO and CTO. We were told that we had to do X, Y, and Z to address the security concerns in our product. X and Y, we knew how to implement.

When it came time to execute on Z, sometime around midnight, we struggled. We worked for hours, providing regular status updates to the product manager, who held periodic meetings with the CEO and CTO. A few hours before dawn, I jumped on a call with the CEO and CTO to ask more questions about Z.

"Z? You don't need to do Z," the CEO said. It turns out the product manager somehow communicated the wrong information to us, and our updates about struggling to implement Z weren't reaching the CEO. We had withheld our security fixes and given up sleeping time working on something that wasn't needed.

So it's important that, when things go terribly wrong, there are clear and direct lines of communication. In emergencies, hierarchies need to be broken down to prevent miscommunication between the people doing the work and those making the decisions.

It's also important to learn from mistakes and incorporate that wisdom into documentation and tooling that will prevent or mitigate future emergencies.

Blameless postmortems, written runbooks, and an on-call rotation that gives engineers ownership of production failures separate the organizations that learn from incidents from those that relive them.

The notions of blame and guilt are generally incompatible with a culture of quality. If employees fear punishment for honest failure analysis or for making mistakes, issues will be hidden rather than tackled.

People will make mistakes. In many companies, it is common to blame the person who made the mistake. But the system allowed the mistake to happen and propagate. For instance, deleting a production database shouldn't be easy, but some companies have inadequate guardrails for dangerous operations. When an intern accidentally drops the production database, the focus should be on the ease of deleting production data, not on the intern.

It's also important to note that while failure analysis systems like the Five Whys are useful, it is rare for any failure to have a single root cause. Often, a better approach is the fishbone analysis, which can help broaden the analysis of any failure.

Open Lines of Communication

Sometimes the CEO is the last person to know when perceived product quality has fallen off a cliff. I was an engineer at Evernote when Jason Kincaid wrote "Evernote, the bug-ridden elephant".

Designing an organization that honestly evaluates its own performance takes a lot of work. The people with the biggest paychecks aren't inclined to bring bad news to the CEO.

Just as the CEO must defend the standards of quality, they also must periodically monitor where quality is lacking.

And let's be real: no company has a perfect track record. Google, Meta, Apple, Amazon, Anthropic, and OpenAI all have outages and bugs. Car companies recall their vehicles. Pharmaceutical companies and medical device manufacturers have recalls. No product is perfect.

The implication is clear. Your product still ships with flaws even if they're hidden from you. If quality lapses are hidden, you're not in control.

To make quality visible from the CEO's office, first consider metrics for a gross (but gameable) view:

The raw number of churned customers, measured over time
The revenue churn, measured over time
The number of defects released in production over time
The number of support tickets/calls per customer
The customer LTV

Then create opportunities for qualitative data to reach you:

An anonymous form for employee quality concerns
A periodic review with the customer support team, where they raise their product concerns from the support perspective
A periodic review with the sales team to understand how prospects think about product quality
A periodic review with the onboarding team to discuss the challenges faced by new customers
A monthly review of escalated support tickets — the ones the front line couldn't close
NPS or CSAT surveys reviewed quarterly, with verbatim comments read, not just the score
Churn-exit interviews on departing accounts, sampled until the patterns stop changing
Periodic scheduled calls with the most important customers to hear their thoughts and concerns
Spot-check calls with random customers from different profiles and cohorts to ensure that the software meets their needs

The mix of quantitative and qualitative measures keeps the quality system balanced. Strong metrics mean nothing if customers aren't satisfied. Strong metrics also catch a slip before customers do — which is the whole point of having them.

Closing

Quality slips as the business scales because, at the start, only a few people needed to understand what quality meant for the business. The people responsible for the product didn't need to articulate those values. As the business grew, that knowledge and expectation didn't propagate to the new employees. A gap opened up. The organizations that grow without losing quality are the ones where the CEO pushes their team to keep quality policies, guardrails, and signals up to date — and defends each against the inevitable pressure to cut corners.

For the diagnostic version of this argument — why releases keep slipping and the infrastructure that catches problems before users do — see Why Your Software Team Can't Ship.

Growing Software Quality

John M. P. Knox

Leadership

Middle Management Pressures

What is Quality in Software?

Balance

Quality Strategies

CEOs Drive Quality

Build Automatic Guardrails

Build Human Guardrails

Build Monitoring Systems

Respond When Things Break

Open Lines of Communication

Closing

Want to Talk?

Growing Software Quality

John M. P. Knox

Leadership

Middle Management Pressures

What is Quality in Software?

Balance

Quality Strategies

CEOs Drive Quality

Build Automatic Guardrails

Build Human Guardrails

Build Monitoring Systems

Respond When Things Break

Open Lines of Communication

Closing

Want to Talk?

Get in Touch

Message Sent!